精华内容
下载资源
问答
  • 爬虫demo

    2020-02-08 15:57:31
    之前看到别人爬虫就很想试试 现在爬虫的话 用的是webmagic框架 爬的就是csdn里的文章 寻找规律 进行爬取 挺好玩的东西 大家可以尝试 有兴趣的可以加我qq770666529

    爬虫Demo

    在这里插入图片描述

    依赖

    这里可以用jpa或者mybatis保存数据

    <?xml version="1.0" encoding="UTF-8"?>
    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      
        <modelVersion>4.0.0</modelVersion>
    
        <groupId>com.thh</groupId>
        <artifactId>blog-spider</artifactId>
        
         <parent>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-parent</artifactId>
            <version>2.1.4.RELEASE</version>
            <relativePath />
        </parent>
        <!--爬取框架webmagic-->
        <dependencies>
    
            <dependency>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-starter-data-jpa</artifactId>
            </dependency>
    
    
            <dependency>
                <groupId>mysql</groupId>
                <artifactId>mysql-connector-java</artifactId>
            </dependency>
    
    
            <dependency>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-starter-data-redis</artifactId>
            </dependency>
            <dependency>
                <groupId>org.projectlombok</groupId>
                <artifactId>lombok</artifactId>
            </dependency>
            <dependency>
                <groupId>io.jsonwebtoken</groupId>
                <artifactId>jjwt</artifactId>
                <version>0.6.0</version>
            </dependency>
    
            <!--web起步依赖-->
            <dependency>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-starter-web</artifactId>
            </dependency>
    
    
            <dependency>
                <groupId>us.codecraft</groupId>
                <artifactId>webmagic-core</artifactId>
                <version>0.7.3</version>
                <exclusions>
                    <exclusion>
                        <groupId>org.slf4j</groupId>
                        <artifactId>slf4j-log4j12</artifactId>
                    </exclusion>
                </exclusions>
            </dependency>
    
            <dependency>
                <groupId>us.codecraft</groupId>
                <artifactId>webmagic-extension</artifactId>
                <version>0.7.3</version>
            </dependency>
    
            <!--mybatis分页插件-->
            <dependency>
                <groupId>com.github.pagehelper</groupId>
                <artifactId>pagehelper-spring-boot-starter</artifactId>
                <version>1.2.3</version>
            </dependency>
            <dependency>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-configuration-processor</artifactId>
                <optional>true</optional>
            </dependency>
    
            <!-- 通用mapper -->
            <dependency>
                <groupId>tk.mybatis</groupId>
                <artifactId>mapper-spring-boot-starter</artifactId>
                <version>2.1.5</version>
            </dependency>
    
    
    
        </dependencies>
    
    </project>
    
    
    application.yml
    server:
      port: 7777
    spring:
      datasource:
        username: root
        password: 123456
        url: jdbc:mysql://192.168.139.128:3306/thh_blog?useUnicode=true&characterEncoding=utf-8&zeroDateTimeBehavior=convertToNull&transformedBitIsBoolean=true&useSSL=false&serverTimezone=GMT%2B8
        driver-class-name: com.mysql.cj.jdbc.Driver
    
      jpa:
        show-sql: true
        generate-ddl: false
        database: mysql
      redis:
        host: 192.168.139.128
        port: 6379
    
    mybatis:
      mapper-locations: classpath:mappers/*.xml  #此处为Mybatis的Mapper.xml文件路径
      type-aliases-package: com.thh.spider.pojo  # 此处为实体Bean的路径
    
    

    logback-spring.xml 日志

    <?xml version="1.0" encoding="UTF-8"?>
    <configuration>
        <contextName>${APP_NAME}</contextName>
    
        <springProperty name="APP_NAME" scope="context" source="spring.application.name"/>
        <springProperty name="LOG_FILE" scope="context" source="logging.file" defaultValue="./logs/${APP_NAME}"/>
        <springProperty name="LOG_POINT_FILE" scope="context" source="logging.file" defaultValue="./logs/point"/>
        <springProperty name="LOG_MAXFILESIZE" scope="context" source="logback.filesize" defaultValue="50MB"/>
        <springProperty name="LOG_FILEMAXDAY" scope="context" source="logback.filemaxday" defaultValue="7"/>
        <springProperty name="ServerIP" scope="context" source="spring.cloud.client.ip-address" defaultValue="0.0.0.0"/>
        <springProperty name="ServerPort" scope="context" source="server.port" defaultValue="0000"/>
    
        <!-- 彩色日志 -->
        <!-- 彩色日志依赖的渲染类 -->
        <conversionRule conversionWord="clr" converterClass="org.springframework.boot.logging.logback.ColorConverter"/>
        <conversionRule conversionWord="wex"
                        converterClass="org.springframework.boot.logging.logback.WhitespaceThrowableProxyConverter"/>
        <conversionRule conversionWord="wEx"
                        converterClass="org.springframework.boot.logging.logback.ExtendedWhitespaceThrowableProxyConverter"/>
    
        <!-- 彩色日志格式 -->
        <property name="CONSOLE_LOG_PATTERN"
                  value="[${APP_NAME}:${ServerIP}:${ServerPort}] [%clr(%X{traceid}){yellow},%clr(%X{X-B3-TraceId}){yellow}] %clr(%d{yyyy-MM-dd HH:mm:ss.SSS}){faint} %clr(%level){blue} %clr(${PID}){magenta} %clr([%thread]){orange} %clr(%logger){cyan} %m%n${LOG_EXCEPTION_CONVERSION_WORD:-%wEx}"/>
        <property name="CONSOLE_LOG_PATTERN_NO_COLOR"
                  value="[${APP_NAME}:${ServerIP}:${ServerPort}] [%X{traceid},%X{X-B3-TraceId}] %d{yyyy-MM-dd HH:mm:ss.SSS} %level ${PID} [%thread] %logger %m%n${LOG_EXCEPTION_CONVERSION_WORD:-%wEx}"/>
    
    
        <!-- 控制台日志 -->
        <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
            <withJansi>true</withJansi>
            <encoder>
                <pattern>${CONSOLE_LOG_PATTERN}</pattern>
                <charset>UTF-8</charset>
            </encoder>
        </appender>
    
        <!-- 按照每天生成常规日志文件 -->
        <appender name="ERROR" class="ch.qos.logback.core.rolling.RollingFileAppender">
            <file>${LOG_FILE}/${APP_NAME}-error.log</file>
            <!-- 基于时间的分包策略 -->
            <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
                <fileNamePattern>${LOG_FILE}/${APP_NAME}-error.%d{yyyy-MM-dd}.%i.log</fileNamePattern>
                <maxFileSize>100MB</maxFileSize>
                <!--保留时间,单位:天-->
                <maxHistory>60</maxHistory>
            </rollingPolicy>
            <encoder>
                <pattern>${CONSOLE_LOG_PATTERN_NO_COLOR}</pattern>
                <charset>UTF-8</charset>
            </encoder>
            <triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
                <MaxFileSize>100MB</MaxFileSize>
            </triggeringPolicy>
            <filter class="ch.qos.logback.classic.filter.LevelFilter"><!-- 只打印错误日志 -->
                <level>ERROR</level>
                <onMatch>ACCEPT</onMatch>
                <onMismatch>DENY</onMismatch>
            </filter>
        </appender>
    
        <!-- 按照每天生成常规日志文件 -->
        <appender name="INFO" class="ch.qos.logback.core.rolling.RollingFileAppender">
            <file>${LOG_FILE}/${APP_NAME}-info.log</file>
            <!-- 基于时间的分包策略 -->
            <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
                <fileNamePattern>${LOG_FILE}/${APP_NAME}-info.%d{yyyy-MM-dd}.%i.log</fileNamePattern>
                <maxFileSize>100MB</maxFileSize>
                <!--保留时间,单位:天-->
                <maxHistory>60</maxHistory>
            </rollingPolicy>
            <encoder>
                <pattern>${CONSOLE_LOG_PATTERN_NO_COLOR}</pattern>
                <charset>UTF-8</charset>
            </encoder>
            <filter class="ch.qos.logback.classic.filter.LevelFilter">
                <level>INFO</level>
                <onMatch>ACCEPT</onMatch>
                <onMismatch>DENY</onMismatch>
            </filter>
        </appender>
    
        <root level="INFO">
            <appender-ref ref="STDOUT"/>
            <appender-ref ref="ERROR"/>
            <appender-ref ref="INFO"/>
        </root>
    
    </configuration>
    

    在这里插入图片描述

    sql
    DROP TABLE IF EXISTS `t_blog_spider`;
    CREATE TABLE `t_blog_spider`  (
      `uid` varchar(64) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL COMMENT '唯一uid',
      `title` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '博客标题',
      `summary` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '博客简介',
      `content` longtext CHARACTER SET utf8 COLLATE utf8_general_ci NULL COMMENT '博客内容',
      `tag_uid` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '标签uid',
      `click_count` int(11) NULL DEFAULT 0 COMMENT '博客点击数',
      `collect_count` int(11) NULL DEFAULT 0 COMMENT '博客收藏数',
      `file_uid` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '标题图片uid',
      `status` tinyint(1) UNSIGNED NOT NULL DEFAULT 1 COMMENT '状态',
      `create_time` timestamp(0) NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP(0) COMMENT '创建时间',
      `update_time` timestamp(0) NULL DEFAULT NULL COMMENT '更新时间',
      `admin_uid` varchar(32) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '管理员uid',
      `is_original` varchar(1) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT '1' COMMENT '是否原创(0:不是 1:是)',
      `author` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '作者',
      `articles_part` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '文章出处',
      `blog_sort_uid` varchar(32) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL COMMENT '博客分类UID',
      `level` tinyint(1) NULL DEFAULT 0 COMMENT '推荐等级(0:正常)',
      `is_publish` varchar(1) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT '1' COMMENT '是否发布:0:否,1:是',
      `sort` int(11) NOT NULL DEFAULT 0 COMMENT '排序字段',
      PRIMARY KEY (`uid`) USING BTREE
    ) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci COMMENT = '博客爬取表' ROW_FORMAT = Dynamic;
    
    dao
    package com.thh.spider.dao;
    
    import org.springframework.data.jpa.repository.JpaRepository;
    import org.springframework.data.jpa.repository.JpaSpecificationExecutor;
    
    import com.thh.spider.pojo.Blog;
    import tk.mybatis.mapper.common.Mapper;
    
    /**
     * blog数据访问接口
     * @author Administrator
     *
     */
    public interface BlogDao  extends Mapper<Blog> {
    
    }
    
    
    pipeline
    package com.thh.spider.pipeline;
    
    import com.thh.spider.dao.BlogDao;
    import com.thh.spider.pojo.Blog;
    import com.thh.spider.util.DownloadUtil;
    import com.thh.spider.util.IdWorker;
    import org.springframework.beans.factory.annotation.Autowired;
    import org.springframework.stereotype.Component;
    import org.springframework.util.StringUtils;
    import us.codecraft.webmagic.ResultItems;
    import us.codecraft.webmagic.Task;
    import us.codecraft.webmagic.pipeline.Pipeline;
    
    import java.io.IOException;
    import java.util.Date;
    
    @Component
    public class BlogPipeline implements Pipeline {
    
    
    	@Autowired
    	private IdWorker idWorker;
    
    	@Autowired
    	private BlogDao blogDao;
    
    
    	private final String SAVE_PATH = "C:/Users/King/Desktop/tensquare/webmgicFile/piantuUrl/youxi/";
    
    
    	@Override
    	public void process(ResultItems res, Task task) {
    		//获取title和content
    		String  title = res.get("title");
    		String  content = res.get("content");
    		System.out.println("title: "+title);
    		System.out.println("content: "+content);
    		if (!StringUtils.isEmpty(title) && !StringUtils.isEmpty(content)){
    
    			try {
    				Blog blog = new Blog();
    				blog.setUid(idWorker.nextId()+""); //主键
    				blog.setTitle(title);                   //标题
    				blog.setSummary("爬取到的页面");
    				blog.setContent(content);              //博客内容
    				blog.setTagUid("1");
    				blog.setClickCount(0); //点击数
    				blog.setCollectCount(0); //收藏数
    				blog.setFile_uid(null);
    				blog.setStatus(1);     //状态1
    				Date now = new Date();
    				blog.setCreateTime(now);//创建时间
    				blog.setUpdateTime(now);
    				blog.setAdminUid("1f01cd1d2f474743b241d74008b12333");//管理员uid 写死
    				blog.setAuthor("作者");
    				blog.setArticlesPart("辉皇博客");
    				blog.setBlogSortUid("1");
    				blog.setLevel(1);
    				blog.setIsPublish("1");
    				blog.setSort(0);
    				blogDao.insertSelective(blog);
    				//下载到本地
    				//DownloadUtil.download("http://pic.netbian.com"+fileUrl,fileName,SAVE_PATH);
    			} catch (Exception e) {
    				e.printStackTrace();
    			}
    		}
    
    
    
    
    	}
    
    
    }
    
    
    pojo

    用到了lombok插件 如果不想安装就全部get set方法就好 不需要@Data

    package com.thh.spider.pojo;
    
    import lombok.Data;
    
    import javax.persistence.Column;
    import javax.persistence.Entity;
    import javax.persistence.Id;
    import javax.persistence.Table;
    import java.io.Serializable;
    import java.util.Date;
    
    /**
     * blog实体类
     * @author thh
     *
     */
    
    @Table(name="t_blog_spider")
    @Data
    public class Blog implements Serializable{
    
    	@Id
    	@Column(name="uid")
    	private String uid;//唯一uid
    
    	@Column(name="title")
    	private String title;//博客标题
    
    	@Column(name="summary")
    	private String summary;//博客简介
    
    	@Column(name="content")
    	private String content;//博客内容
    
    	@Column(name="tag_uid")
    	private String tagUid;//标签uid
    
    	@Column(name="click_count")
    	private Integer clickCount;//博客点击数
    
    	@Column(name="collect_count")
    	private Integer collectCount;//博客收藏数
    
    	@Column(name="fileUid")
    	private String file_uid;//标题图片uid
    
    	@Column(name="status")
    	private Integer status;//状态
    
    	@Column(name="create_time")
    	private Date createTime;//创建时间
    
    	@Column(name="update_time")
    	private Date updateTime;//更新时间
    
    	@Column(name="admin_uid")
    	private String adminUid;//管理员uid
    
    	@Column(name="is_original")
    	private String isOriginal;//是否原创(0:不是 1:是)
    
    	@Column(name="author")
    	private String author;//作者
    
    	@Column(name="articles_part")
    	private String articlesPart;//文章出处
    
    	@Column(name="blog_sort_uid")
    	private String blogSortUid;//博客分类UID
    
    
    	@Column(name="level")
    	private Integer level;//推荐等级(0:正常)
    
    	@Column(name="is_publish")
    	private String isPublish;//是否发布:0:否,1:是
    
    	@Column(name="sort")
    	private Integer sort;//排序字段
    
    
    }
    
    

    processer

    package com.thh.spider.processer;
    
    import org.springframework.stereotype.Component;
    import us.codecraft.webmagic.Page;
    import us.codecraft.webmagic.Site;
    import us.codecraft.webmagic.processor.PageProcessor;
    import us.codecraft.webmagic.selector.Selectable;
    
    import java.util.List;
    
    @Component
    public class BlogProcesser implements PageProcessor {
    
    
    	/**
    	 * 处理我们需要的页面
    	 */
    
    
    	@Override
    	public void process(Page page) {
    		List<String> list = page.getHtml().regex("https://blog.csdn.net/[a-zA-Z0-9_]+/article/details/[0-9]{9}").all();
    		this.saveBlogInfo(page);
    		page.addTargetRequests(list);
    
    	/*	if(list==null || list.size()==0){
    			// 如果为空 表示这是图片详情页
    			this.saveBlogInfo(page);
    		}else {
    			// 如果不为空 表示这是列表页 解析出详情页的url地址 放到任务队列中
    			*//*for (Selectable selectable : list) {
    				//获取url地址
    			String details = 	selectable.links().toString();
    			page.addTargetRequest(details);
    			}*//*
    
    			for (String details : list) {
    				//获取url地址
    				page.addTargetRequest(details);
    			}
    		}*/
    
    	}
    
    	private void saveBlogInfo(Page page) {
    
    		//2、获取我们需要的内容: title和content
    		String title =  page.getHtml().xpath("//*[@id=\"mainBox\"]/main/div[1]/div/div/div[1]/h1/text()").toString();
    		String content =  page.getHtml().xpath("//*[@id=\"article_content\"]").toString();
    
    
    		if(title!=null){
    			page.putField("title", title);
    			page.putField("content", content);
    		}else {
    			page.setSkip(true);//跳过爬取
    		}
    
    	}
    
    	//站点信息设置
    	@Override
    	public Site getSite() {
    		return Site.me().setCharset("utf8").setRetryTimes(2).setSleepTime(500).setTimeOut(2000);
    	}
    
    }
    
    
    定时任务类
    package com.thh.spider.task;
    
    import com.thh.spider.pipeline.BlogPipeline;
    import com.thh.spider.processer.BlogProcesser;
    import org.springframework.beans.factory.annotation.Autowired;
    import org.springframework.scheduling.annotation.Scheduled;
    import org.springframework.stereotype.Component;
    import us.codecraft.webmagic.Spider;
    import us.codecraft.webmagic.pipeline.FilePipeline;
    import us.codecraft.webmagic.scheduler.QueueScheduler;
    import us.codecraft.webmagic.scheduler.RedisScheduler;
    
    /**
     * 定时任务类
     */
    @Component
    public class BlogTask {
    
        @Autowired
        private BlogProcesser blogProcesser;
    
        @Autowired
        private BlogPipeline blogPipeline;
    
        @Autowired
        private RedisScheduler redisScheduler;
    
    
    
    
        /**
         * 爬取文章: 爬取数据库分类
         */
        //@Scheduled(cron = "0/20 * * * * ?")
        //initialDelay 任务启动后多久后执行
        //fixedDelay 多久执行一次
        @Scheduled(initialDelay = 1000,fixedDelay = 100*1000)
        public void webArticleTask(){
            //开启蜘蛛爬取内容
            Spider.create(blogProcesser)
                    .addUrl("https://www.csdn.net/")
                    .addPipeline(blogPipeline)
                    //.setScheduler(redisScheduler) //redis来效验存储重复爬取的链接
                   .setScheduler(new QueueScheduler()) //内存效验重复爬取  关闭就失效
                    .thread(10)  //十个线程
                    .run();
        }
    
    }
    
    
    util

    下载工具类

    DownloadUtil

    package com.thh.spider.util;
    import java.io.*;
    import java.net.URL;
    import java.net.URLConnection;
    
    /**
     * 下载工具类
     */
    public class DownloadUtil {
    
        public static void download(String urlStr,String filename,String savePath) throws IOException {
            URL url = new URL(urlStr);
            //打开url连接
            URLConnection connection = url.openConnection();
            //请求超时时间
            connection.setConnectTimeout(5000);
            //输入流
            InputStream in = connection.getInputStream();
            //缓冲数据
            byte [] bytes = new byte[1024];
            //数据长度
            int len;
            //文件
            File file = new File(savePath);
            if(!file.exists())
                file.mkdirs();
            OutputStream out = new FileOutputStream(file.getPath()+"\\"+filename);
            //先读到bytes中
            while ((len=in.read(bytes))!=-1){
            //再从bytes中写入文件
                out.write(bytes,0,len);
            }
            //关闭IO
            out.close();
            in.close();
        }
    }
    
    

    idWork

    package com.thh.spider.util;
    
    import java.lang.management.ManagementFactory;
    import java.net.InetAddress;
    import java.net.NetworkInterface;
    
    /**
     * <p>名称:IdWorker.java</p>
     * <p>描述:分布式自增长ID</p>
     * <pre>
     *     Twitter的 Snowflake JAVA实现方案
     * </pre>
     * 核心代码为其IdWorker这个类实现,其原理结构如下,我分别用一个0表示一位,用—分割开部分的作用:
     * 1||0---0000000000 0000000000 0000000000 0000000000 0 --- 00000 ---00000 ---000000000000
     * 在上面的字符串中,第一位为未使用(实际上也可作为long的符号位),接下来的41位为毫秒级时间,
     * 然后5位datacenter标识位,5位机器ID(并不算标识符,实际是为线程标识),
     * 然后12位该毫秒内的当前毫秒内的计数,加起来刚好64位,为一个Long型。
     * 这样的好处是,整体上按照时间自增排序,并且整个分布式系统内不会产生ID碰撞(由datacenter和机器ID作区分),
     * 并且效率较高,经测试,snowflake每秒能够产生26万ID左右,完全满足需要。
     * <p>
     * 64位ID (42(毫秒)+5(机器ID)+5(业务编码)+12(重复累加))
     *
     * @author Polim
     */
    public class IdWorker {
        // 时间起始标记点,作为基准,一般取系统的最近时间(一旦确定不能变动)
        private final static long twepoch = 1288834974657L;
        // 机器标识位数
        private final static long workerIdBits = 5L;
        // 数据中心标识位数
        private final static long datacenterIdBits = 5L;
        // 机器ID最大值
        private final static long maxWorkerId = -1L ^ (-1L << workerIdBits);
        // 数据中心ID最大值
        private final static long maxDatacenterId = -1L ^ (-1L << datacenterIdBits);
        // 毫秒内自增位
        private final static long sequenceBits = 12L;
        // 机器ID偏左移12位
        private final static long workerIdShift = sequenceBits;
        // 数据中心ID左移17位
        private final static long datacenterIdShift = sequenceBits + workerIdBits;
        // 时间毫秒左移22位
        private final static long timestampLeftShift = sequenceBits + workerIdBits + datacenterIdBits;
    
        private final static long sequenceMask = -1L ^ (-1L << sequenceBits);
        /* 上次生产id时间戳 */
        private static long lastTimestamp = -1L;
        // 0,并发控制
        private long sequence = 0L;
    
        private final long workerId;
        // 数据标识id部分
        private final long datacenterId;
    
        public IdWorker(){
            this.datacenterId = getDatacenterId(maxDatacenterId);
            this.workerId = getMaxWorkerId(datacenterId, maxWorkerId);
        }
        /**
         * @param workerId
         *            工作机器ID
         * @param datacenterId
         *            序列号
         */
        public IdWorker(long workerId, long datacenterId) {
            if (workerId > maxWorkerId || workerId < 0) {
                throw new IllegalArgumentException(String.format("worker Id can't be greater than %d or less than 0", maxWorkerId));
            }
            if (datacenterId > maxDatacenterId || datacenterId < 0) {
                throw new IllegalArgumentException(String.format("datacenter Id can't be greater than %d or less than 0", maxDatacenterId));
            }
            this.workerId = workerId;
            this.datacenterId = datacenterId;
        }
        /**
         * 获取下一个ID
         *
         * @return
         */
        public synchronized long nextId() {
            long timestamp = timeGen();
            if (timestamp < lastTimestamp) {
                throw new RuntimeException(String.format("Clock moved backwards.  Refusing to generate id for %d milliseconds", lastTimestamp - timestamp));
            }
    
            if (lastTimestamp == timestamp) {
                // 当前毫秒内,则+1
                sequence = (sequence + 1) & sequenceMask;
                if (sequence == 0) {
                    // 当前毫秒内计数满了,则等待下一秒
                    timestamp = tilNextMillis(lastTimestamp);
                }
            } else {
                sequence = 0L;
            }
            lastTimestamp = timestamp;
            // ID偏移组合生成最终的ID,并返回ID
            long nextId = ((timestamp - twepoch) << timestampLeftShift)
                    | (datacenterId << datacenterIdShift)
                    | (workerId << workerIdShift) | sequence;
    
            return nextId;
        }
    
        private long tilNextMillis(final long lastTimestamp) {
            long timestamp = this.timeGen();
            while (timestamp <= lastTimestamp) {
                timestamp = this.timeGen();
            }
            return timestamp;
        }
    
        private long timeGen() {
            return System.currentTimeMillis();
        }
    
        /**
         * <p>
         * 获取 maxWorkerId
         * </p>
         */
        protected static long getMaxWorkerId(long datacenterId, long maxWorkerId) {
            StringBuffer mpid = new StringBuffer();
            mpid.append(datacenterId);
            String name = ManagementFactory.getRuntimeMXBean().getName();
            if (!name.isEmpty()) {
                /*
                 * GET jvmPid
                 */
                mpid.append(name.split("@")[0]);
            }
            /*
             * MAC + PID 的 hashcode 获取16个低位
             */
            return (mpid.toString().hashCode() & 0xffff) % (maxWorkerId + 1);
        }
    
        /**
         * <p>
         * 数据标识id部分
         * </p>
         */
        protected static long getDatacenterId(long maxDatacenterId) {
            long id = 0L;
            try {
                InetAddress ip = InetAddress.getLocalHost();
                NetworkInterface network = NetworkInterface.getByInetAddress(ip);
                if (network == null) {
                    id = 1L;
                } else {
                    byte[] mac = network.getHardwareAddress();
                    id = ((0x000000FF & (long) mac[mac.length - 1])
                            | (0x0000FF00 & (((long) mac[mac.length - 2]) << 8))) >> 6;
                    id = id % (maxDatacenterId + 1);
                }
            } catch (Exception e) {
                System.out.println(" getDatacenterId: " + e.getMessage());
            }
            return id;
        }
    
    
        public static void main(String[] args) {
            //推特  26万个不重复的ID
            IdWorker idWorker = new IdWorker(0,0);
            for (int i = 0; i <2600 ; i++) {
                System.out.println(idWorker.nextId());
            }
        }
    
    }
    
    

    启动类

    package com.thh.spider;
    
    
    import com.thh.spider.util.IdWorker;
    import org.springframework.beans.factory.annotation.Value;
    import org.springframework.boot.SpringApplication;
    import org.springframework.boot.autoconfigure.SpringBootApplication;
    
    import org.springframework.context.annotation.Bean;
    import org.springframework.scheduling.annotation.EnableScheduling;
    import tk.mybatis.spring.annotation.MapperScan;
    import us.codecraft.webmagic.scheduler.RedisScheduler;
    
    
    @SpringBootApplication
    //开启定时启动
    @EnableScheduling
    @MapperScan(basePackages = {"com.thh.spider.dao"})
    public class SpiderApplication {
    
        public static void main(String[] args) {
    
    
            SpringApplication.run(SpiderApplication.class,args);
    
        }
    
        //注入redis的ip地址
        @Value("${spring.redis.host}")
        private String host;
    
        //url去重使用的
        @Bean
        public RedisScheduler redisScheduler() {
            return new RedisScheduler(host);
        }
    
        @Bean
        public IdWorker idWorker() {
            return new IdWorker(1,1);
        }
    
    }
    
    
    展开全文
  • 爬虫Demo

    2017-12-08 16:26:00
    分别使用了多线程 多进程 gevent测试爬虫. from gevent import monkey monkey.patch_all() from os import ( path, makedirs ) from time import time from threading import Thread from multiprocessing ...

    分别使用了多线程 多进程 gevent测试爬虫.

    from gevent import monkey
    
    monkey.patch_all()
    
    from os import (
        path,
        makedirs
    )
    from time import time
    from threading import Thread
    from multiprocessing import Pool as MPool
    from gevent.pool import Pool
    import requests
    from bs4 import BeautifulSoup
    from functools import wraps
    
    
    def timer(fun):
        @wraps(fun)
        def wrapper(*args, **kwargs):
            self = args[0]
            start_at = time()
            f = fun(*args, **kwargs)
            use_dt = int(time() - start_at)
            print("一共耗时{}秒".format(use_dt))
            print("一共下载{}张照片".format(self.count))
            return f
    
        return wrapper
    
    
    class DownLoadBaiDu(object):
        def __init__(self, keyword, full=False, start: int = 0, stop: int = 1):
            self.full = full
            self.search_url = "https://tieba.baidu.com/f?ie=utf-8&kw={}&fr=search".format(keyword)
            self.start = start
            self.stop = stop
            self.count = 0
    
        @property
        def image_path(self):
            img_path = path.join(path.dirname(__file__), "img")
            if not path.exists(img_path):
                makedirs(img_path)
            return img_path
    
        def save_jpg(self, url):
            print("url:{}".format(url))
            _jpg = path.join(self.image_path, path.split(url)[1])
            jpg_content = requests.get(url).content
            with open("{}".format(_jpg), "wb") as f:
                f.write(jpg_content)
            self.count += 1
    
        def download_img(self, uri):
            url = "https://tieba.baidu.com/{}".format(uri)
            html = requests.get(url).text
            print("\n帖子链接:{}\n".format(url))
            soup = BeautifulSoup(html)
            imgs = soup.find_all("img")
            for img in imgs:
                _img = img.attrs["src"]
                if _img.startswith("http") and _img.endswith("jpg"):
                    self.save_jpg(url=_img)
    
        def get_url_number(self):
            """获取搜索结果的id"""
            numbers = []
            h = requests.get(self.search_url).text
            soup = BeautifulSoup(h)
            ds = soup.find_all(attrs={"class": "j_th_tit"})
            print("一共获取到{}个帖子".format(len(ds)))
    
            if not self.full:
                ds = ds[self.start: self.stop]
    
            for index, i in enumerate(ds):
                if len(i) == 1:
                    uri = i.attrs["href"]
                else:
                    uri = i.find("a").attrs["href"]
                numbers.append(uri)
            return numbers
    
        @timer
        def multipr_down(self):
            """多进程,跑多进程时注释掉gevent的相关import"""
            urls = self.get_url_number()
            p = MPool()
            for url in urls:
                p.apply_async(self.download_img, args=(url,))
            p.close()
            p.join()
    
        @timer
        def thread_down(self):
            """多线程"""
            urls = self.get_url_number()
            ts = []
            for url in urls:
                t = Thread(target=self.download_img, args=[url])
                ts.append(t)
            for i in ts:
                i.start()
    
            for i in ts:
                i.join()
    
        @timer
        def gevent_down(self):
            """gevent处理"""
            urls = self.get_url_number()
            pool = Pool(len(urls))
            pool.map(self.download_img, urls)
    
        @timer
        def get(self):
            """单进程"""
            numbers = self.get_url_number()
            for index, number in enumerate(numbers):
                print("正在下载第{}个帖子的图片".format(index + 1))
                self.download_img(number)
    
    
    if __name__ == '__main__':
        d = DownLoadBaiDu(keyword="美女", full=True)
        d.gevent_down()
    
    """
    单进程:
    一共耗时155秒
    一共下载718张照片
    
    多线程:
    一共耗时34秒
    一共下载866张照片 
    
    Gevent:
    一共耗时30秒
    一共下载770张照片
    """
    
    
    223916_bL9y_2663968.jpg
    展开全文
  • python爬虫demo

    2017-12-04 14:34:59
    python爬虫demo
  • java爬虫Demo

    2018-11-26 10:21:26
    一个简单的java爬虫Demo
  • Java爬虫Demo

    2018-11-28 10:22:50
    一个简单的Java爬虫demo ,简单易懂,自己整理的希望能帮助大家。
  • jsoup爬虫demo

    2018-07-06 11:42:59
    使用java的jsoup爬虫demo,爬取页面上的内容并使用输出流写到本地
  • HtmlUnit 爬虫Demo

    2017-03-23 17:07:20
    HtmlUnit 爬虫Demo,有最全面的方法
  • python3.4爬虫demo

    2020-09-19 16:54:33
    今天小编就为大家分享一篇关于python3.4爬虫demo,小编觉得内容挺不错的,现在分享给大家,具有很好的参考价值,需要的朋友一起跟随小编来看看吧
  • Python 爬百度百科 爬虫 Demo
  • .Net爬虫Demo
  • java网页爬虫demo

    2016-04-26 23:17:08
    java网页爬虫demo,完整demo,SpiderWidth.java为main类
  • 爬虫搜索,简单的搜索引擎,java爬虫,搜索引擎例子,爬虫demo,java实现互联网内容抓取,搜索引擎大揭密.java爬虫程序。web搜索。爬虫程序。sigar搜索,定时搜索互联网内容信息。
  • python-爬虫demo.zip

    2019-08-05 18:44:28
    python写的爬虫demo,可以爬网页上的详细数据,demo简单易懂,直接用
  • Python爬虫Demo

    2021-04-14 13:53:15
    Python爬虫Demo import json import os import pymysql import requests from lxml import etree import time import csv # 定义函数抓取 def crow_first(a): # 构造每一页的url变化 url = 'https://*******/...

    Python爬虫Demo   

    import json
    import os

    import pymysql
    import requests
    from lxml import etree
    import time
    import csv


    # 定义函数抓取
    def crow_first(a):
        # 构造每一页的url变化
        url = 'https://*******/portal_server/lawyer/query/'+str(a or '')
        r = requests.get(url)
        # 指定编码方式,不然会出现乱码
        r.encoding = 'utf-8'

        # 将 Python 字典类型转换为 JSON 对象
        json_str = json.dumps(r.json())

        # 将 JSON 对象类型转换为 Python 字典
        user_dic = json.loads(json_str)

        a=str(a or '')
        nameStr=str(user_dic['data']['name'] or '')
        licenseNumber=str(user_dic['data']['licenseNumber']or '')
        description=str(user_dic['data']['description']or '')
        phone=str(user_dic['data']['phone']or '')
        photo=str(user_dic['data']['photo']['name']or '')

        config = {
            'host': '127.0.0.1'
            , 'user': 'root'
            , 'password': 'root'
            , 'database': 'test'
            , 'charset': 'utf8'
            , 'port': 3306  # 注意端口为int 而不是str
        }

        db = pymysql.connect(**config)
        cursor = db.cursor()

        try:
            db.select_db("test")
            sql = "INSERT INTO test.cdtf_data(id,name,licenseNumber,description,phone,photo)VALUES(%s,%s,%s,%s,%s,%s)"
            cursor.execute(sql, (a, nameStr, licenseNumber, description,phone,photo))
            db.commit()
            print('插入数据成功')
        except Exception as e:
            db.rollback()
            print("插入数据失败")
            print('Failed:', e)

        cursor.close()
        db.close()
        time.sleep(3)
    #图片保存
        url = "https://cld.cdtf.gov.cn/attachment_server/"+photo
        root = "d://pics//";
        path = root + url.split('/')[-1]
        try:
            if not os.path.exists(root):
                os.mkdir(root)
            if not os.path.exists(path):
                r = requests.get(url)
                with open(path, 'wb') as f:
                    f.write(r.content)
                    f.close()
                    print("图片保存cg")
            else:
                print("保存失败")
        except:
            print('爬取失败')

    # 定义函数抓取
    def crow_list(a):
        # 分页查询
        url = 'https://******/portal_server/lawyer/query/page?pageNum='+str(a or '')
        r = requests.get(url)
        # 指定编码方式,不然会出现乱码
        r.encoding = 'utf-8'

        # 将 Python 字典类型转换为 JSON 对象
        json_str = json.dumps(r.json())

        # 将 JSON 对象类型转换为 Python 字典
        user_dic = json.loads(json_str)

        print("------当前页数为" + str(a));
        list=user_dic['data']['records']
        for user in list:
            print('当前律师ID:'+user['id'])
            crow_first(user['id'])

    if __name__ == '__main__':
        for i in range(150, 2399):#2399为每页十条的总页数。
            crow_list(i)

    展开全文
  • java爬虫demo

    2019-04-02 15:04:06
    java爬虫demo网络爬虫的基本概念网络爬虫的分类网页内容获取工具 jsoupjsoup 解析 URL 加载的 Documentjsoup 使用中的遍历jsoup 选择器的使用网页内容获取工具 HttpClientHttpClient 相关 Jar 的下载HttpClient 的...

    网络爬虫的基本概念

    网络爬虫(Web Crawler),又称为网络蜘蛛(Web Spider)或 Web 信息采集器,是一种按照一定规则,自动抓取或下载网络信息的计算机程序或自动化脚本,是目前搜索引擎的重要组成部分。

    狭义上理解:利用标准的 HTTP 协议,根据网络超链接(如https://www.baidu.com/)和 Web 文档检索的方法(如深度优先)遍历万维网信 息空间的软件程序。

    功能上理解:确定待爬的 URL 队列,获取每个 URL 对应的网页内容(如 HTML/JSON),解析网页内容,并存储对应的数据。

    网络爬虫的分类

    网络爬虫按照系统架构和实现技术,大致可以分为以下几种类型:通用网络爬虫(General Purpose Web Crawler)、聚焦网络爬虫(Focused Web Crawler)、增量式网络爬虫(Incremental Web Crawler)、深层网络爬虫(Deep Web Crawler)。实际的网络爬虫系统通常是几种爬虫技术相结合实现的。

    通用网络爬虫:爬行对象从一些种子 URL 扩充到整个 Web,主要为门户站点搜索引擎和大型 Web 服务提供商采集数据。

    通用网络爬虫的爬取范围和数量巨大,对于爬行速度和存储空间要求较高,对于爬行页面的顺序要求较低,通常采用并行工作方式,有较强的应用价值。

    聚焦网络爬虫,又称为主题网络爬虫:是指选择性地爬行那些与预先定义好的主题相关的页面。

    和通用爬虫相比,聚焦爬虫只需要爬行与主题相关的页面,极大地节省了硬件和网络资源,保存的页面也由于数量少而更新快,可以很好地满足一些特定人群对特定领域信息的需求。

    以当当网来举一个例子:当当网
    在这里插入图片描述
    关于网络爬虫的一个流程:
    在这里插入图片描述

    网页内容获取工具 jsoup

    jsoup下载

    <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
            <dependency>
                <groupId>org.jsoup</groupId>
                <artifactId>jsoup</artifactId>
                <version>1.10.3</version>
            </dependency>
    

    jsoup 提供的 connect(String url) 方法创建一个新的 Connection,并通过 get() 获取网页对应的 HTML 文件,具体如下:

    Document doc = Jsoup.connect("http://www.w3school.com.cn/b.asp").get();
    System.out.println(doc);
    

    jsoup 也可以使用 Post() 方法请求网页内容,使用如下:

    //可以看到该网址通过两种方法都能请求到内容
    Document doc = Jsoup.connect("http://www.w3school.com.cn/b.asp").post();
    System.out.println(doc);
    

    jsoup 解析 URL 加载的 Document

    //获取URL对应的HTML内容
    Document doc = Jsoup.connect("http://www.w3school.com.cn/b.asp").timeout(5000).get();
    //select选择器解析内容
    Element element = doc.select("div#w3school").get(0); //获取Element,这里相当于div[id=w3school]
    String text1 = element.select("h1").text(); //从Element提取内容(抽取一个Node对应的信息)
    String text2 = element.select("p").text(); //从Element提取内容(抽取一个Node对应的信息)
    System.out.println("输出解析的元素内容为:");
    System.out.println(element);
    System.out.println("抽取的文本信息为:");
    System.out.println(text1 + "\t" + text2);
    

    jsoup 使用中的遍历

    <div id="course">
      <ul>
        <li>
          <a href="/js/index.asp" title="JavaScript 教程">JavaScript</a></li>
        <li>
          <a href="/htmldom/index.asp" title="HTML DOM 教程">HTML DOM</a></li>
        <li>
          <a href="/jquery/index.asp" title="jQuery 教程">jQuery</a></li>
        <li>
          <a href="/ajax/index.asp" title="AJAX 教程">AJAX</a></li>
        <li>
          <a href="/json/index.asp" title="JSON 教程">JSON</a></li>
        <li>
          <a href="/dhtml/index.asp" title="DHTML 教程">DHTML</a></li>
        <li>
          <a href="/e4x/index.asp" title="E4X 教程">E4X</a></li>
        <li>
          <a href="/wmlscript/index.asp" title="WMLScript 教程">WMLScript</a></li>
      </ul>
    </div>
    
    //获取URL对应的HTML内容
    Document doc = Jsoup.connect("http://www.w3school.com.cn/b.asp").timeout(5000).get();
    //层层定位到要解析的内容,可以发现包含多个li标签
    Elements elements = doc.select("div[id=course]").select("li"); 
    //遍历每一个li节点
    for (Element ele : elements) {
        String title = ele.select("a").text();  //.text()为解析标签中的文本内容
        String course_url = ele.select("a").attr("href");  //.attr(String)表示获取标签内某一属性的内容
        System.out.println("课程的标题为:" + title + "\t对应的URL为" + course_url);
    }
    

    jsoup 选择器的使用

    基本语法
    Element.select(String selector);
    Elements.select(String selector);

    //获取URL对应的HTML内容
    Document doc = Jsoup.connect("http://www.w3school.com.cn/b.asp").timeout(5000).get();
    //[attr=value]: 利用属性值来查找元素,例如[id=course]; 通过tagname: 通过标签查找元素,比如:a
    System.out.println(doc.select("[id=course]").select("a").get(0).text());
    //fb[[attr=value]:利用标签属性联合查找
    System.out.println(doc.select("div[id=course]").select("a").get(0).text());
    //#id: 通过ID查找元素,例如,#course
    System.out.println(doc.select("#course").select("a").get(0).text());
    //通过属性属性查找元素,比如:[href]
    System.out.println(doc.select("#course").select("[href]").get(0).text());
    //.class通过class名称查找元素
    System.out.println(doc.select(".browserscripting").text());
    //[attr^=value], [attr$=value], [attr*=value]利用匹配属性值开头、结尾或包含属性值来查找元素(很常用的方法)
    System.out.println(doc.select("#course").select("[href$=index.asp]").text());
    //[attr~=regex]: 利用属性值匹配正则表达式来查找元素,*指匹配所有元素
    System.out.println(doc.select("#course").select("[href~=/*]").text());
    
    

    网页内容获取工具 HttpClient

    HttpClient 相关 Jar 的下载

            <dependency>
                <groupId>org.apache.httpcomponents</groupId>
                <artifactId>httpcore</artifactId>
                <version>4.4.9</version>
            </dependency>
    

    HttpClient 的使用

    get() 方法的使用

    HttpClient client = new DefaultHttpClient();   //初始化httpclient
    String personalUrl = "http://www.w3school.com.cn/b.asp";     //请求的地址URL
    HttpGet getMethod = new HttpGet(personalUrl);       //  get方法请求
    HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1,
            HttpStatus.SC_OK, "OK");                        //初始化HTTP响应        
    response = client.execute(getMethod);                   //执行响应
    String status = response.getStatusLine().toString();    //响应状态
    int StatusCode = response.getStatusLine().getStatusCode(); //获取响应状态码
    ProtocolVersion protocolVersion = response.getProtocolVersion(); //协议的版本号
    String phrase = response.getStatusLine().getReasonPhrase(); //是否ok
    if(StatusCode == 200){                          //状态码200表示响应成功
        //获取实体内容,这里为HTML内容
        String entity = EntityUtils.toString (response.getEntity(),"gbk");
        //输出实体内容
        System.out.println(entity);
        EntityUtils.consume(response.getEntity());       //消耗实体
    }else {
        //关闭HttpEntity的流实体
        EntityUtils.consume(response.getEntity());        //消耗实体
    }
    

    post() 方法的使用

    List<NameValuePair> nvps= new ArrayList<NameValuePair>();
    nvps.add(new BasicNameValuePair("param1", "value1"));
    nvps.add(new BasicNameValuePair("param2", "value2"));
    UrlEncodedFormEntity entity = new UrlEncodedFormEntity(nvps, Consts.UTF_8);
    HttpPost httppost = new HttpPost("http://localhost/handler.do");
    httppost.setEntity(entity);
    

    举一个栗子

    以爬取当当网的一个书为例:
    JdongMain是程序的入口、JdongBook对应京东上出售的书籍、URLHandle是对URL和client的处理,通过它返回经过加工的数据、HTTPUtils发送真正的HTTP请求,并返回响应报文、jdParse是对响应报文的实体内容进行解析。

    代码:

    需要下载的jar:

    <dependencies>
            <dependency>
                <groupId>org.apache.httpcomponents</groupId>
                <artifactId>httpclient</artifactId>
                <version>4.5.2</version>
            </dependency>
            <dependency>
                <groupId>org.apache.httpcomponents</groupId>
                <artifactId>httpcore</artifactId>
                <version>4.4.9</version>
            </dependency>
    
            <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
            <dependency>
                <groupId>org.jsoup</groupId>
                <artifactId>jsoup</artifactId>
                <version>1.10.3</version>
            </dependency>
    
            <!-- https://mvnrepository.com/artifact/commons-logging/commons-logging -->
            <dependency>
                <groupId>commons-logging</groupId>
                <artifactId>commons-logging</artifactId>
                <version>1.2</version>
            </dependency>
    
    
    public class JdongBook {
    
        private String bookId;
        private String bookName;
        private String bookPrice;
    
        public JdongBook() {
        }
    
        public String getBookId() {
            return bookId;
        }
    
        public void setBookId(String bookId) {
            this.bookId = bookId;
        }
    
        public String getBookName() {
            return bookName;
        }
    
        public void setBookName(String bookName) {
            this.bookName = bookName;
        }
    
        public String getBookPrice() {
            return bookPrice;
        }
    
        public void setBookPrice(String bookPrice) {
            this.bookPrice = bookPrice;
        }
    
        @Override
        public String toString() {
            return "Book [bookId=" + bookId + ", bookName=" + bookName + ", bookPrice=" + bookPrice + "]";
        }
    
    
    }
    
    import org.apache.http.HttpResponse;
    import org.apache.http.ParseException;
    import org.apache.http.client.HttpClient;
    import org.apache.http.util.EntityUtils;
    
    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.List;
    
    public class URLHandle {
    
        /**
         * @param client 客户端
         * @param url 请求地址
         * @return 请求数据 :List<JdongBook>
         * @throws ParseException
         * @throws IOException
         */
        public static List<JdongBook> urlParser(HttpClient client, String url) throws ParseException, IOException {
            System.out.println(client+"     client     "+url+"        url   ");
            List<JdongBook> data = new ArrayList<JdongBook>();
    
            //获取响应资源
            HttpResponse response = HTTPUtils.getHtml(client, url);
            //获取响应的状态码
            int sattusCode = response.getStatusLine().getStatusCode();
            System.err.println(sattusCode+"   2000    ");
            if(sattusCode == 200) {//200表示成功
                //获取响应实体内容,并且将其转换为utf-8形式的字符串编码
                String entity = EntityUtils.toString(response.getEntity(), "utf-8");
                System.out.println(" 121"+entity);
                data = JdParse.getData(entity);
                System.out.println("________________________________");
                System.out.println(data);
            } else {
                EntityUtils.consume(response.getEntity());//释放资源实体
            }
            return data;
        }
    
    
    }
    
    import org.jsoup.Jsoup;
    import org.jsoup.select.Elements;
    
    
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    
    import java.util.ArrayList;
    import java.util.List;
    
    public class JdParse {
    
        /**
         * 根据实体获取程序所需数据
         * @param entity HTTP响应实体内容
         * @return
         */
        public static List<JdongBook> getData(String entity) {
            System.out.println("================================getDatagetDatagetDatagetDatagetDatagetDatagetDatagetData===================================");
            List<JdongBook> data =new ArrayList<JdongBook>();
            System.out.println("1111"+data);
            //采用jsoup解析,关于jsoup的使用,见下文总结
            Document doc = Jsoup.parse(entity);
           // System.err.println(doc+"1");
    
    
    
            //根据页面内容分析出需要的元素
            Elements elements = doc.select("#price_notice_box").select(".price_notice_alert").select(".price_remind_notice");
            System.err.println(elements+"3");
    
            for(Element element : elements) {
                JdongBook book = new JdongBook();
                book.setBookId(element.select(".top_title").attr(".top_title"));
                book.setBookName(element.select("div[class=remind_info]").select("span").text());
                book.setBookPrice(element.select("div[class=now_price]").select("span").text());
                System.out.println("=====================================");
                System.out.println(book);
                data.add(book);
                System.out.println("-------------------");
                System.err.println(data);
            }
            return data;
        }
    
    }
    
    
    import org.apache.http.impl.client.DefaultHttpClient;
    import org.apache.http.client.HttpClient;
    
    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.List;
    
    public class JdongMain {
    
        public static void main(String[] args) {
            //生成一个客户端,通过客户端可url向服务器发送请求,并接收响应
            HttpClient client = new DefaultHttpClient();
            String url="http://product.dangdang.com/26445445.html";
            List<JdongBook> bookList =null;
            try {
                bookList = URLHandle.urlParser(client, url);
                System.err.println("bookList:             "+bookList);
    
                for(JdongBook book : bookList) {
                    System.out.println(" #  "+book);
                }
            } catch (Exception e) {
                e.printStackTrace();
            }
    
        }
    
    
    }
    
    import org.apache.http.HttpResponse;
    import org.apache.http.HttpStatus;
    import org.apache.http.HttpVersion;
    import org.apache.http.client.HttpClient;
    import org.apache.http.client.methods.HttpGet;
    import org.apache.http.client.methods.HttpPost;
    import org.apache.http.message.BasicHttpResponse;
    
    import java.io.IOException;
    
    public class HTTPUtils {
    
        public static HttpResponse getHtml(HttpClient client, String url) {
            System.out.println("HttpResponse"+client+"   client  "+url+"     url   ");
    
            //获取响应文件,即HTML,采用get方法获取响应数据
            HttpGet getMethod = new HttpGet(url);
    //        HttpPost getMethod=new HttpPost(url);
            System.out.println("getHttpGet     "+getMethod);
            HttpResponse response =new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK,"OK");
            System.err.println(response+"   HttpResponse          ");
    
            try {
                //通过client执行get方法
                response = client.execute(getMethod);
                System.out.println(response+"                  respone   ");
            } catch (IOException e) {
                e.printStackTrace();
            } finally {
                //getMethod.abort();
            }
    
            return response;
        }
    }
    
    

    结果展示
    在这里插入图片描述

    展开全文
  • 前言上次我们使用nightmare这款自动化测试工具进行了百度图片关键词查询结果的爬虫,效率略低. 现在我们尝试其他的方式,更高效的进行爬虫操作.这里使用的npm包主要有:axioscheerio //本次没有用到mysql //我们需要把...
  • 自己编写的一个SSM整合内容,使用WebMagic框架写了关于CSDN的三个从简单到稍微复杂的爬虫Demo,以及一个爬取斗图网表情包数据的Demo,测试都可运行后进行打包上传.使用Mysql5.5数据库.做了一个前台用于展示CSDN爬取的...
  • 来自于华为云开发者大会,使用Python爬虫抓取图片和文字实验,应用Scrapy框架进行数据抓取,保存应用了mysql数据库,实验采用的是线上服务器,而这里照抄全是本地进行,如有不同,那肯定是本渣渣瞎改了!step1.配置...
  • 爬虫Demo实例

    2018-01-29 13:13:59
    一款帮助大家理解爬虫Demo,通过这款Demo,我想各位肯定会更理解一点
  • python 爬虫demo

    2018-12-13 18:56:26
    python 爬虫的小demo,从网站爬取图片并下载 测试的是游民星空的图片哈哈
  • python爬虫demo01

    2019-10-02 17:07:08
    python爬虫demo01 1 import requests, json, time, sys 2 from bs4 import BeautifulSoup 3 from contextlib import closing 4 5 url = 'https://image.xiaozhustatic1.com/12/9,0,27,3473,1800...
  • PY爬虫Demo集合

    2016-09-19 14:40:42
    PY爬虫Demo集合 - 不定时更新
  • golang爬虫代码,本demo是爬取贴吧的分页,并且可以获取每个URL里面的内容! 实现了找到DIV和href。... 例如: [\s\S]+?href="(\/p\/[\s\S]+?...使用方式,命令行输入:go run 10Golang方式实现贴吧爬虫demo.go
  • php爬虫 demo

    2018-11-23 14:13:18
    《我用爬虫一天时间“偷了”知乎一百万用户,只为证明PHP是世界上最好的语言 》所使用的程序 ,里面有现成的demo 模仿着写就可以
  • 基于Python3.7的简单的爬虫Demo,包含爬取百度百科、51job北京java岗位的招聘信息,并把爬取内容保存在MySQL数据库

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 2,216
精华内容 886
关键字:

爬虫demo

爬虫 订阅