精华内容
下载资源
问答
  • 在这个数据为王的时代,爬虫应用地越来越广泛,对于一个萌新程序员来说如果你要做爬虫,那么Python是你的不二之选。但是对于那些老腊肉的Java程序员(亦或者你是程序媛)想使用Java做爬虫也不是不行,只是没有Python...

    在这个数据为王的时代,爬虫应用地越来越广泛,对于一个萌新程序员来说如果你要做爬虫,那么Python是你的不二之选。但是对于那些老腊肉的Java程序员(亦或者你是程序媛)想使用Java做爬虫也不是不行,只是没有Python那么方便。身为一块Java老腊肉的我在此记录一下自己在使用Java做网络爬虫使用的工具类。

    在pom.xml文件中引入commons-lang3 依赖:

    		<dependency>
    			<groupId>org.apache.commons</groupId>
    			<artifactId>commons-lang3</artifactId>
    			<version>3.6</version>
    		</dependency>

     SpiderHttpUtils 工具类完整代码如下: 

    import java.io.BufferedInputStream;
    import java.io.BufferedReader;
    import java.io.ByteArrayOutputStream;
    import java.io.InputStreamReader;
    import java.io.UnsupportedEncodingException;
    import java.net.HttpURLConnection;
    import java.net.URL;
    import java.net.URLConnection;
    import java.net.URLEncoder;
    import java.security.cert.CertificateException;
    import java.security.cert.X509Certificate;
    import java.util.Map;
    
    import javax.net.ssl.HttpsURLConnection;
    import javax.net.ssl.SSLContext;
    import javax.net.ssl.SSLSocketFactory;
    import javax.net.ssl.TrustManager;
    import javax.net.ssl.X509TrustManager;
    
    import org.apache.commons.lang3.StringUtils;
    
    public class SpiderHttpUtils {
    
    	public static String sendGet(boolean isHttps, String requestUrl, Map<String, String> params,
    			Map<String, String> headers, String charSet) {
    		if (StringUtils.isBlank(requestUrl)) {
    			return "";
    		}
    		if (StringUtils.isBlank(charSet)) {
    			charSet = "UTF-8";
    		}
    		URL url = null;
    		URLConnection conn = null;
    		BufferedReader br = null;
    
    		try {
    			// 创建连接
    			url = new URL(requestUrl + "?" + requestParamsBuild(params));
    			if (isHttps) {
    				conn = getHttpsUrlConnection(url);
    			} else {
    				conn = (HttpURLConnection) url.openConnection();
    			}
    
    			// 设置请求头通用属性
    
    			// 指定客户端能够接收的内容类型
    			conn.setRequestProperty("Accept", "*/*");
    
    			// 设置连接的状态为长连接
    			conn.setRequestProperty("Connection", "keep-alive");
    
    			// 设置发送请求的客户机系统信息
    			conn.setRequestProperty("User-Agent",
    					"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36");
    
    			// 设置请求头自定义属性
    			if (null != headers && headers.size() > 0) {
    
    				for (Map.Entry<String, String> entry : headers.entrySet()) {
    					conn.setRequestProperty(entry.getKey(), entry.getValue());
    				}
    			}
    
    			// 设置其他属性
    			// conn.setUseCaches(false);//不使用缓存
    			// conn.setReadTimeout(10000);// 设置读取超时时间
    			// conn.setConnectTimeout(10000);// 设置连接超时时间
    
    			// 建立实际连接
    			conn.connect();
    
    			// 读取请求结果
    			br = new BufferedReader(new InputStreamReader(conn.getInputStream(), charSet));
    			String line = null;
    			StringBuilder sb = new StringBuilder();
    			while ((line = br.readLine()) != null) {
    				sb.append(line);
    			}
    			return sb.toString();
    		} catch (Exception exception) {
    			return "";
    		} finally {
    			try {
    				if (br != null) {
    					br.close();
    				}
    			} catch (Exception e) {
    				e.printStackTrace();
    			}
    		}
    
    	}
    
    	public static String requestParamsBuild(Map<String, String> map) {
    		String result = "";
    		if (null != map && map.size() > 0) {
    			StringBuffer sb = new StringBuffer();
    			for (Map.Entry<String, String> entry : map.entrySet()) {
    				try {
    					String value = URLEncoder.encode(entry.getValue(), "UTF-8");
    					sb.append(entry.getKey() + "=" + value + "&");
    				} catch (UnsupportedEncodingException e) {
    					e.printStackTrace();
    				}
    			}
    
    			result = sb.substring(0, sb.length() - 1);
    		}
    		return result;
    	}
    
    	private static HttpsURLConnection getHttpsUrlConnection(URL url) throws Exception {
    		HttpsURLConnection httpsConn = (HttpsURLConnection) url.openConnection();
    		// 创建SSLContext对象,并使用我们指定的信任管理器初始化
    		TrustManager[] tm = { new X509TrustManager() {
    			public void checkClientTrusted(X509Certificate[] chain, String authType) throws CertificateException {
    				// 检查客户端证书
    			}
    
    			public void checkServerTrusted(X509Certificate[] chain, String authType) throws CertificateException {
    				// 检查服务器端证书
    			}
    
    			public X509Certificate[] getAcceptedIssuers() {
    				// 返回受信任的X509证书数组
    				return null;
    			}
    		} };
    		SSLContext sslContext = SSLContext.getInstance("SSL", "SunJSSE");
    		sslContext.init(null, tm, new java.security.SecureRandom());
    		// 从上述SSLContext对象中得到SSLSocketFactory对象
    		SSLSocketFactory ssf = sslContext.getSocketFactory();
    		httpsConn.setSSLSocketFactory(ssf);
    		return httpsConn;
    
    	}
    
    	public static byte[] getFileAsByte(boolean isHttps, String requestUrl) {
    		if (StringUtils.isBlank(requestUrl)) {
    			return new byte[0];
    		}
    		URL url = null;
    		URLConnection conn = null;
    		BufferedInputStream bi = null;
    
    		try {
    			// 创建连接
    			url = new URL(requestUrl);
    			if (isHttps) {
    				conn = getHttpsUrlConnection(url);
    			} else {
    				conn = (HttpURLConnection) url.openConnection();
    			}
    
    			// 设置请求头通用属性
    
    			// 指定客户端能够接收的内容类型
    			conn.setRequestProperty("accept", "*/*");
    
    			// 设置连接的状态为长连接
    			conn.setRequestProperty("Connection", "keep-alive");
    
    			// 设置发送请求的客户机系统信息
    			conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;SV1)");
    			// 设置其他属性
    			conn.setConnectTimeout(3000);// 设置连接超时时间
    
    			conn.setDoOutput(true);
    			conn.setDoInput(true);
    
    			// 建立实际连接
    			conn.connect();
    
    			// 读取请求结果
    			bi = new BufferedInputStream(conn.getInputStream());
    			ByteArrayOutputStream outStream = new ByteArrayOutputStream();
    			byte[] buffer = new byte[2048];
    			int len = 0;
    			while ((len = bi.read(buffer)) != -1) {
    				outStream.write(buffer, 0, len);
    			}
    			bi.close();
    			byte[] data = outStream.toByteArray();
    			return data;
    		} catch (Exception exception) {
    			return new byte[0];
    		} finally {
    			try {
    				if (bi != null) {
    					bi.close();
    				}
    			} catch (Exception e) {
    				e.printStackTrace();
    			}
    		}
    
    	}
    
    }
    

     

    展开全文
  • Python3+Scrapy实现网页爬虫

    万次阅读 多人点赞 2017-05-03 09:55:10
    网页爬虫设计 项目驱动,需要从网站上爬取文章,并上传至服务器,实现模拟用户发帖。 框架采用Python3,配合爬虫框架Scrapy实现,目前只能抓取静态页,JS+Ajax动态加载的网页见下一篇博客 GitHub地址:...

    网页爬虫设计

    项目驱动,需要从网站上爬取文章,并上传至服务器,实现模拟用户发帖。

    框架采用Python3,配合爬虫框架Scrapy实现,目前只能抓取静态页,JS+Ajax动态加载的网页见下一篇博客

    GitHub地址https://github.com/JohonseZhang/Scrapy-Spider-based-on-Python3
    求Star~

    另外,爬取类似今日头条、淘宝、京东等动态加载网站的需要配合selenium和phantomjs框架:
    [GitHub地址]:https://github.com/JohonseZhang/python3-scrapy-spider-phantomjs-selenium
    求Star~求Star~求Star~

    项目结构

    代码结构图:
    这里写图片描述

    创建项目

    • 进入指定文件夹,右击空白处>在此处打开命令行窗口
    • 创建项目
    Scrapy startproject DgSpider

    主要代码文件说明

    • 爬虫主类 :UrlSpider.py、ContentSpider.py
      项目包含2个爬虫主类,分别用于爬取文章列表页所有文章的URL、文章详情页具体内容
    • 内容处理类 :pipelines.py
      处理内容
    • 传输字段类 :items.py
      暂存爬取的数据
    • 设置文件 :settings.py
      用于主要的参数配置
    • 数据库操作:mysqlUtils.py
      链接操作数据库

    代码实现

    • UrlSpider.py
    # -*- coding: utf-8 -*-
    
    import scrapy
    from DgSpider.items import DgspiderUrlItem
    from scrapy.selector import Selector
    from DgSpider import urlSettings
    
    
    class DgUrlSpider(scrapy.Spider):
        print('Spider DgUrlSpider Staring...')
    
        # 爬虫名 必须静态指定
        # name = urlSettings.SPIDER_NAME
        name = 'DgUrlSpider'
    
        # 设定域名
        allowed_domains = [urlSettings.DOMAIN]
    
        # 爬取地址
        url_list = []
        """一般来说,列表页第一页不符合规则,单独append"""
        url_list.append(urlSettings.START_LIST_URL)
        loop = urlSettings.LIST_URL_RULER_LOOP
        for i in range(1, loop):
            url = urlSettings.LIST_URL_RULER_PREFIX + str(i) + urlSettings.LIST_URL_RULER_SUFFIX
            url_list.append(url)
        start_urls = url_list
    
        # 爬取方法
        def parse(self, response):
    
            # sel : 页面源代码
            sel = Selector(response)
    
            item_url = DgspiderUrlItem()
            url_item = []
    
            # XPATH获取url
            url_list = sel.xpath(urlSettings.POST_URL_XPATH).extract()
    
            # 消除http前缀差异
            for url in url_list:
                url = url.replace('http:', '')
                url_item.append('http:' + url)
    
            # list去重
            url_item = list(set(url_item))
            item_url['url'] = url_item
    
            yield item_url
    
    • ContentSpider.py
    # -*- coding: utf-8 -*-
    
    import scrapy
    from DgSpider.mysqlUtils import dbhandle_geturl
    from DgSpider.items import DgspiderPostItem
    from scrapy.selector import Selector
    from scrapy.http import Request
    from DgSpider import contentSettings
    from DgSpider import urlSettings
    from DgSpider.mysqlUtils import dbhandle_update_status
    
    
    class DgContentSpider(scrapy.Spider):
        print('Spider DgContentSpider Staring...')
    
        result = dbhandle_geturl(urlSettings.GROUP_ID)
    
        url = result[0]
        spider_name = result[1]
        site = result[2]
        gid = result[3]
        module = result[4]
    
        # 爬虫名 必须静态指定
        # name = contentSettings.SPIDER_NAME
        name = 'DgContentSpider'
    
        # 设定爬取域名范围
        allowed_domains = [site]
    
        # 爬取地址
        # start_urls = ['http://www.mama.cn/baby/art/20140829/774422.html']
        start_urls = [url]
    
        start_urls_tmp = []
        """构造分页序列,一般来说遵循规则 url.html,url_2.html,url_3.html,并且url.html也写为url_1.html"""
        for i in range(6, 1, -1):
            start_single = url[:-5]
            start_urls_tmp.append(start_single+"_"+str(i)+".html")
    
        # 更新状态
        """对于爬去网页,无论是否爬取成功都将设置status为1,避免死循环"""
        dbhandle_update_status(url, 1)
    
        # 爬取方法
        def parse(self, response):
            item = DgspiderPostItem()
    
            # sel : 页面源代码
            sel = Selector(response)
    
            item['url'] = DgContentSpider.url
    
            # 对于title, <div><h1><span aaa><span>标题1</h1></div>,使用下列方法取得
            data_title_tmp = sel.xpath(contentSettings.POST_TITLE_XPATH)
            item['title'] = data_title_tmp.xpath('string(.)').extract()
    
            item['text'] = sel.xpath(contentSettings.POST_CONTENT_XPATH).extract()
    
            yield item
    
            if self.start_urls_tmp:
                url = self.start_urls_tmp.pop()
                yield Request(url, callback=self.parse)
    
    • pipelines.py
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    # If you have many piplelines, all should be init here
    # and use IF to judge them
    #
    # DOUGUO Spider pipelines
    # @author zhangjianfei
    # @date 2017/04/13
    
    import re
    import urllib.request
    from DgSpider import urlSettings
    from DgSpider import contentSettings
    from DgSpider.mysqlUtils import dbhandle_insert_content
    from DgSpider.uploadUtils import uploadImage
    from DgSpider.mysqlUtils import dbhandle_online
    from DgSpider.mysqlUtils import dbhandle_update_status
    from bs4 import BeautifulSoup
    from DgSpider.PostHandle import post_handel
    from DgSpider.commonUtils import get_random_user
    from DgSpider.commonUtils import get_linkmd5id
    
    
    class DgPipeline(object):
        # post构造reply
        cs = []
    
        # 帖子title
        title = ''
    
        # 帖子文本
        text = ''
    
        # 当前爬取的url
        url = ''
    
        # 随机用户ID
        user_id = ''
    
        # 图片flag
        has_img = 0
    
        # get title flag
        get_title_flag = 0
    
        def __init__(self):
            DgPipeline.user_id = get_random_user(contentSettings.CREATE_POST_USER)
    
        # process the data
        def process_item(self, item, spider):
            self.get_title_flag += 1
    
            # pipeline for content
            if spider.name == contentSettings.SPIDER_NAME:
    
                # 获取当前网页url
                DgPipeline.url = item['url']
    
                # 获取post title
                if len(item['title']) == 0:
                    title_tmp = ''
                else:
                    title_tmp = item['title'][0]
    
                # 替换标题中可能会引起 sql syntax 的符号
                # 对于分页的文章,只取得第一页的标题
                if self.get_title_flag == 1:
    
                    # 使用beautifulSoup格什化标题
                    soup_title = BeautifulSoup(title_tmp, "lxml")
                    title = ''
                    # 对于bs之后的html树形结构,不使用.prettify(),对于bs, prettify后每一个标签自动换行,造成多个、
                    # 多行的空格、换行,使用stripped_strings获取文本
                    for string in soup_title.stripped_strings:
                        title += string
    
                    title = title.replace("'", "”").replace('"', '“')
                    DgPipeline.title = title
    
                # 获取正post内容
                if len(item['text']) == 0:
                    text_temp = ''
                else:
                    text_temp = item['text'][0]
    
                # 获取图片
                reg_img = re.compile(r'<img.*>')
                imgs = reg_img.findall(text_temp)
                for img in imgs:
                    DgPipeline.has_img = 1
    
                    # matchObj = re.search('.*src="(.*)"{2}.*', img, re.M | re.I)
                    match_obj = re.search('.*src="(.*)".*', img, re.M | re.I)
                    img_url_tmp = match_obj.group(1)
    
                    # 去除所有Http:标签
                    img_url_tmp = img_url_tmp.replace("http:", "")
    
                    # 对于<img src="http://a.jpg" title="a.jpg">这种情况单独处理
                    imgUrl_tmp_list = img_url_tmp.split('"')
                    img_url_tmp = imgUrl_tmp_list[0]
    
                    # 加入http
                    imgUrl = 'http:' + img_url_tmp
    
                    list_name = imgUrl.split('/')
                    file_name = list_name[len(list_name)-1]
    
                    # if os.path.exists(settings.IMAGES_STORE):
                    #     os.makedirs(settings.IMAGES_STORE)
    
                    # 获取图片本地存储路径
                    file_path = contentSettings.IMAGES_STORE + file_name
                    # 获取图片并上传至本地
                    urllib.request.urlretrieve(imgUrl, file_path)
                    upload_img_result_json = uploadImage(file_path, 'image/jpeg', DgPipeline.user_id)
                    # 获取上传之后返回的服务器图片路径、宽、高
                    img_u = upload_img_result_json['result']['image_url']
                    img_w = upload_img_result_json['result']['w']
                    img_h = upload_img_result_json['result']['h']
                    img_upload_flag = str(img_u)+';'+str(img_w)+';'+str(img_h)
    
                    # 在图片前后插入字符标记
                    text_temp = text_temp.replace(img, '[dgimg]' + img_upload_flag + '[/dgimg]')
    
                # 使用beautifulSoup格什化HTML
                soup = BeautifulSoup(text_temp, "lxml")
                text = ''
                # 对于bs之后的html树形结构,不使用.prettify(),对于bs, prettify后每一个标签自动换行,造成多个、
                # 多行的空格、换行
                for string in soup.stripped_strings:
                    text += string + '\n'
    
                # 替换因为双引号为中文双引号,避免 mysql syntax
                DgPipeline.text = self.text + text.replace('"', '“')
    
                # 对于分页的文章,每一页之间加入换行
                # DgPipeline.text += (DgPipeline.text + '\n')
    
            # pipeline for url
            elif spider.name == urlSettings.SPIDER_NAME:
                db_object = dbhandle_online()
                cursor = db_object.cursor()
    
                for url in item['url']:
                    linkmd5id = get_linkmd5id(url)
                    spider_name = contentSettings.SPIDER_NAME
                    site = urlSettings.DOMAIN
                    gid = urlSettings.GROUP_ID
                    module = urlSettings.MODULE
                    status = '0'
                    sql_search = 'select md5_url from dg_spider.dg_spider_post where md5_url="%s"' % linkmd5id
                    sql = 'insert into dg_spider.dg_spider_post(md5_url, url, spider_name, site, gid, module, status) ' \
                          'values("%s", "%s", "%s", "%s", "%s", "%s", "%s")' \
                          % (linkmd5id, url, spider_name, site, gid, module, status)
                    try:
                        # 判断url是否存在,如果不存在,则插入
                        cursor.execute(sql_search)
                        result_search = cursor.fetchone()
                        if result_search is None or result_search[0].strip() == '':
                            cursor.execute(sql)
                            result = cursor.fetchone()
                            db_object.commit()
                    except Exception as e:
                        print(">>> catch exception !")
                        print(e)
                        db_object.rollback()
    
            return item
    
        # spider开启时被调用
        def open_spider(self, spider):
            pass
    
        # sipder 关闭时被调用
        def close_spider(self, spider):
            if spider.name == contentSettings.SPIDER_NAME:
                # 数据入库:235
                url = DgPipeline.url
                title = DgPipeline.title
                content = DgPipeline.text
                user_id = DgPipeline.user_id
                dbhandle_insert_content(url, title, content, user_id, DgPipeline.has_img)
    
                # 更新status状态为1(已经爬取过内容)
                """此项已在spider启动时设置"""
                # dbhandle_update_status(url, 1)
    
                # 处理文本、设置status、上传至dgCommunity.dg_post
                # 如果判断has_img为1,那么上传帖子
                if DgPipeline.has_img == 1:
                    if title.strip() != '' and content.strip() != '':
                        spider.logger.info('has_img=1,title and content is not null! Uploading post into db...')
                        post_handel(url)
                    else:
                        spider.logger.info('has_img=1,but title or content is null! ready to exit...')
                    pass
                else:
                    spider.logger.info('has_img=0, changing status and ready to exit...')
                    pass
    
            elif spider.name == urlSettings.SPIDER_NAME:
                pass
    
    
    • items.py
    # -*- coding: utf-8 -*-
    # Define here the models for your scraped items
    # douguo Spider Item
    # @author zhangjianfei
    # @date 2017/04/07
    import scrapy
    
    class DgspiderUrlItem(scrapy.Item):
        url = scrapy.Field()
    
    class DgspiderPostItem(scrapy.Item):
        url = scrapy.Field()
        title = scrapy.Field()
        text = scrapy.Field()
    
    • settings.py
      这个文件只需要更改或加上特定的配置项
    BOT_NAME = 'DgSpider'
    
    SPIDER_MODULES = ['DgSpider.spiders']
    NEWSPIDER_MODULE = 'DgSpider.spiders'
    
    # 注册PIPELINES
    ITEM_PIPELINES = {
        'DgSpider.pipelines.DgPipeline': 1
    }
    • mysqlUtils.py
    import pymysql
    import pymysql.cursors
    import os
    
    
    def dbhandle_online():
        host = '192.168.1.235'
        user = 'root'
        passwd = 'douguo2015'
        charset = 'utf8'
        conn = pymysql.connect(
            host=host,
            user=user,
            passwd=passwd,
            charset=charset,
            use_unicode=False
        )
        return conn
    
    
    def dbhandle_local():
        host = '192.168.1.235'
        user = 'root'
        passwd = 'douguo2015'
        charset = 'utf8'
        conn = pymysql.connect(
            host=host,
            user=user,
            passwd=passwd,
            charset=charset,
            use_unicode=True
            # use_unicode=False
        )
        return conn
    
    
    def dbhandle_geturl(gid):
        host = '192.168.1.235'
        user = 'root'
        passwd = 'douguo2015'
        charset = 'utf8'
        conn = pymysql.connect(
            host=host,
            user=user,
            passwd=passwd,
            charset=charset,
            use_unicode=False
        )
        cursor = conn.cursor()
        sql = 'select url,spider_name,site,gid,module from dg_spider.dg_spider_post where status=0 and gid=%s limit 1' % gid
        try:
            cursor.execute(sql)
            result = cursor.fetchone()
            conn.commit()
        except Exception as e:
            print("***** exception")
            print(e)
            conn.rollback()
    
        if result is None:
            os._exit(0)
        else:
            url = result[0]
            spider_name = result[1]
            site = result[2]
            gid = result[3]
            module = result[4]
            return url.decode(), spider_name.decode(), site.decode(), gid.decode(), module.decode()
    
    
    def dbhandle_insert_content(url, title, content, user_id, has_img):
        host = '192.168.1.235'
        user = 'root'
        passwd = 'douguo2015'
        charset = 'utf8'
        conn = pymysql.connect(
            host=host,
            user=user,
            passwd=passwd,
            charset=charset,
            use_unicode=False
        )
        cur = conn.cursor()
    
        # 如果标题或者内容为空,那么程序将退出,篇文章将会作废并将status设置为1,爬虫继续向下运行获得新的URl
        if content.strip() == '' or title.strip() == '':
            sql_fail = 'update dg_spider.dg_spider_post set status="%s" where url="%s" ' % ('1', url)
            try:
                cur.execute(sql_fail)
                result = cur.fetchone()
                conn.commit()
            except Exception as e:
                print(e)
                conn.rollback()
            os._exit(0)
    
        sql = 'update dg_spider.dg_spider_post set title="%s",content="%s",user_id="%s",has_img="%s" where url="%s" ' \
              % (title, content, user_id, has_img, url)
    
        try:
            cur.execute(sql)
            result = cur.fetchone()
            conn.commit()
        except Exception as e:
            print(e)
            conn.rollback()
        return result
    
    
    def dbhandle_update_status(url, status):
        host = '192.168.1.235'
        user = 'root'
        passwd = 'douguo2015'
        charset = 'utf8'
        conn = pymysql.connect(
            host=host,
            user=user,
            passwd=passwd,
            charset=charset,
            use_unicode=False
        )
        cur = conn.cursor()
        sql = 'update dg_spider.dg_spider_post set status="%s" where url="%s" ' \
              % (status, url)
        try:
            cur.execute(sql)
            result = cur.fetchone()
            conn.commit()
        except Exception as e:
            print(e)
            conn.rollback()
        return result
    
    
    def dbhandle_get_content(url):
        host = '192.168.1.235'
        user = 'root'
        passwd = 'douguo2015'
        charset = 'utf8'
        conn = pymysql.connect(
            host=host,
            user=user,
            passwd=passwd,
            charset=charset,
            use_unicode=False
        )
        cursor = conn.cursor()
        sql = 'select title,content,user_id,gid from dg_spider.dg_spider_post where status=1 and url="%s" limit 1' % url
        try:
            cursor.execute(sql)
            result = cursor.fetchone()
            conn.commit()
        except Exception as e:
            print("***** exception")
            print(e)
            conn.rollback()
    
        if result is None:
            os._exit(1)
    
        title = result[0]
        content = result[1]
        user_id = result[2]
        gid = result[3]
        return title.decode(), content.decode(), user_id.decode(), gid.decode()
    
    
    # 获取爬虫初始化参数
    def dbhandle_get_spider_param(url):
        host = '192.168.1.235'
        user = 'root'
        passwd = 'douguo2015'
        charset = 'utf8'
        conn = pymysql.connect(
            host=host,
            user=user,
            passwd=passwd,
            charset=charset,
            use_unicode=False
        )
        cursor = conn.cursor()
        sql = 'select title,content,user_id,gid from dg_spider.dg_spider_post where status=0 and url="%s" limit 1' % url
        result = ''
        try:
            cursor.execute(sql)
            result = cursor.fetchone()
            conn.commit()
        except Exception as e:
            print("***** exception")
            print(e)
            conn.rollback()
        title = result[0]
        content = result[1]
        user_id = result[2]
        gid = result[3]
        return title.decode(), content.decode(), user_id.decode(), gid.decode()
    
    • 一些特别的常亮及参数,也是用py文件加入

      urlSettings.py:

    # 爬取域名
    DOMAIN = 'eastlady.cn'
    
    # 爬虫名
    """ URL爬虫模块名,不可变 """
    SPIDER_NAME = 'DgUrlSpider'
    
    GROUP_ID = '33'
    
    MODULE = '999'
    
    # 文章列表页起始爬取URL
    START_LIST_URL = 'http://www.eastlady.cn/emotion/pxgx/1.html'
    
    # 文章列表循环规则
    LIST_URL_RULER_PREFIX = 'http://www.eastlady.cn/emotion/pxgx/'
    LIST_URL_RULER_SUFFIX = '.html'
    LIST_URL_RULER_LOOP = 30
    
    # 文章URL爬取规则XPATH
    POST_URL_XPATH = '//div[@class="article_list"]/ul/li/span[1]/a[last()]/@href'

    contentSetting:

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for DgSpider project
    
    # 图片储存
    IMAGES_STORE = 'D:\\pics\\jfss\\'
    
    # 爬取域名
    DOMAIN = 'nrsfh.com'
    
    # 图片域名前缀
    DOMAIN_HTTP = "http:"
    
    # 随机发帖用户
    CREATE_POST_USER = '37619,18441390'
    
    # 爬虫名
    SPIDER_NAME = 'DgContentSpider'
    
    # 文章URL爬取规则XPATH
    POST_TITLE_XPATH = '//div[@class="title"]'
    POST_CONTENT_XPATH = '//div[@class="bodycss"]'
    

    启动爬虫

    进入爬虫代码所在的文件夹,右击:在此打开命令行窗口,先执行:

    Scrapy crawl UrlSpider

    进行爬取所有的URL,并入库
    再执行:

    Scrapy crawl ContentSpider

    从数据库中读取URL,抓取网页内容,入库

    当然,也可以洗衣歌windos批处理脚本,持续不断的执行Scrapy crawl ContentSpider:

    @echo DOUGUO window Spider
    cd D:\Scrapy\DgSpider
    for /l %%i in (1,1,7000) do scrapy crawl DgContentSpider
    :end
    @echo SUCCESS! PRESS ANAY KEY TO EXIT! 
    @Pause>nul

    当然,这种方式比较笨拙,最好还是启用cmdline,加入多线程,这里不说明

    处理完上面的所有步骤,就能成功地抓取到网页数据:
    这里写图片描述

    展开全文
  • 特殊网页爬虫——VBA开发文档

    千次阅读 多人点赞 2018-04-15 16:20:27
    特殊网页爬虫——VBA开发文档 作者:AntoniotheFuture关键词:VBA,Access,网页爬虫,网抓开发平台:Access平台版本上限:2010平台版本下限:尚未出现开发语言:VBA简介:目前在一家保险公司上班,统计数据需要...

    特殊网页爬虫——VBA开发文档 

    作者:AntoniotheFuture

    关键词:VBA,Access,网页爬虫,网抓

    开发平台:Access

    平台版本上限:2010

    平台版本下限:尚未出现

    开发语言:VBA

    简介:目前在一家保险公司上班,统计数据需要经常从一个公司的网页系统中下载报表,操作比较简单,但是要操作的东西太多,比较烦人,对于日常数据的提取,我就想着如果可以定制任务就好了,正好我之前也有过一点点网页爬虫的经验,于是着手开始写,对于这个网页呢,有很多“不太友好”的地方,比如说:登陆进去之后会启动新窗口(可能是公司为了信息安全),点击的控件花样百出,不易定位,格式不尽相同,有些日期是MM-DD-YYYY,有些是YYYY-MM-DD HH:MM,而且网页内存在iframe,用传统的网页爬虫无法实现,在参阅了无数资料之后,今天终于开发出来了,现在分享给大家,共同进步。


    功能描述:

    1. 可以预先按步骤设定任务,完全模拟网页中的手工操作,对各种控件进行操作,只需轻轻点击开始,网页即可自动填表,提交,等待报表在网页中产生即可。
    2. 可以重新获取IE的控制权,防止新窗口出现后丢失窗口。
    3. 可以根据预设参数获取时间和日期,如:下一个工作日的前一天的23:59:59。
    4. 加入“工作日表”,可在里面提前设定“补假”,“加班”等特殊工作日,准确判断下一工作日。

    表设计:

        首先是报表列表,用于定位网页中报表页面


    因为网页中有多处重复的HTMLname,而且无法用其他方法定位控件,特加入“开始搜索位置”用于控件的查找


    然后是控件表,用于定位表单页面中的控件,还可以根据预先设定的控件类型做不同的动作。



    任务表,用于记录任务基本信息,比较简单



    任务流程表,在窗体中定制的流程将会记录到这个表中:


    下面是窗体部分:

        控件详情和报表详情窗口,没什么特殊,可用于快速添加网页控件信息。





    任务详情窗体,整合了任务创建,流程设置,登陆信息输入和执行功能

    新建任务后,在增加流程按钮的左边输入要操作的控件和要输入的值类型和值的本身,完成整个任务定制后,保存即可执行,系统将会打开IE窗口并执行相应操作,省去不少时间,还能避免手动输入出错。



     工作日表,用于记录工作日和更改工作日



    下面是部分SQL查询:

    报表列表查询 

    SELECT 报表列表.ID, 报表列表.报表名称
    FROM 报表列表;

    更改工作日类型子窗体
    用于查找下一个工作日

    SELECT 工作日表.*
    FROM 工作日表
    WHERE (((工作日表.工作日)>=DateAdd("d",-7,Date())))
    ORDER BY 工作日表.工作日;

    任务流程查询
    用于在任务详情界面显示流程

    SELECT 任务流程表.任务ID, 任务流程表.流程数, 任务流程表.打开报表, 任务流程表.表ID, [报表列表]![报表名称] AS表名, 任务流程表.控件ID, [element]![名称] AS 控件名, 任务流程表.控件值类型, 任务流程表.控件值
    FROM (任务流程表 LEFTJOIN 报表列表 ON 任务流程表.表ID = 报表列表.ID) LEFT JOIN element ON 任务流程表.控件ID = element.ID

    WHERE (((任务流程表.任务ID)=[Forms]![任务详情]![ID]));


    任务流程转VBA

    与VBA对接,包含了执行任务过程中所需的所有控件数据。

    SELECT 任务流程表.ID, 任务流程表.任务ID, 任务流程表.流程数, 任务流程表.打开报表, 报表列表.报表名称, 报表列表.层级, 报表列表.一级, 报表列表.开始搜索位置, 报表列表.二级, 报表列表.是否使用二级网页位置, 报表列表.二级网页位置, 报表列表.三级, 报表列表.四级, element.名称, element.控件类型, element.值, element.数据类型, element.HTMLname,element.HTMLID, element.时间类型, 任务流程表.控件值类型, 任务流程表.控件值

    FROM (任务流程表 LEFTJOIN 报表列表 ON 任务流程表.表ID = 报表列表.ID) LEFT JOIN element ON 任务流程表.控件ID = element.ID;


    下面是VBA代码部分
      

    更改工作日类型


    '批量修改工作日
    Private Sub Command20_Click()
    Dim STemp2 As String 
    Dim i
    If IsNull(Me.Text0) Then
       MsgBox ("请输入开始日期!")
       Exit Sub
    ElseIf IsNull(Me.Text4) Then
       MsgBox ("请输入结束日期!")
       Exit Sub
    ElseIf IsNull(Me.List40) Then
       MsgBox ("请选择更改类型!")
    Else  
    Dim Rs2 As ADODB.Recordset
    Set Rs2 = New ADODB.Recordset 
    STemp2 = "select * From 工作日表 where 工作日 between #" & Me![Text0]& "# and #" & Me![Text4] & "#"
    Rs2.Open STemp2, CurrentProject.Connection,adOpenKeyset, adLockOptimistic 
    For i = 1 To Rs2.RecordCount
       Rs2("类型") = Me![List40]
       Rs2.Update
       Rs2.MoveNext
    Next
    Me.Refresh
    MsgBox ("成功将"& i - 1 & "天更改为" & Me![List40])
    Exit Sub
    End If
    Exit Sub
    Rs2.Close
    Set Rs2 = Nothing 
    End Sub

    自动登录并获取网页

    用于对付窗口弹出问题


    Private Sub Command268_Click()
    'On Error Resume Next
    '定义变量
    Dim IE As Object
    Dim webs, webs2, webs3, webs4, webs5, dmt,dmt1, dmt2, usrno, elements, element1, xxx
    Dim vtag   '网页对象
    Dim loop1, loop2, loop3   '循环计数器
    Dim objIE As Object, myHWND
    Dim dWinFolder As New ShellWindows, t
    Dim Czpmxurl As String, Czpmxname As String
    Dim Czpmxhwnd As Long, aa        '窗口句柄
    Dim cifno$, cifcname$, ResultLink$ 
    'text9 = 用户名 text11= 密码
    'IE清除缓存&打开登录界面 
    Call DeleteCacheURLList
    Set IE =CreateObject("InternetExplorer.Application")
    IE.Navigate"example.com"
    IE.Visible = True     '若=0 False不显示 ,=1 True 显示
    IE.Silent = True
    Do While IE.Busy Or IE.ReadyState <>4
       DoEvents
    Loop
    delay Me.Combo17  
    Set dmt = IE.Document
    IE.Document.getElementById("j_username").Value= Me.Text9
    IE.Document.getElementById("j_password").Value= Me.Text11
    delay 2
    IE.Document.getElementById("j_password").focus
    SendKeys "{enter}"
    Do While IE.Busy Or IE.ReadyState <>4
       DoEvents
    Loop
    delay Me.Combo17 + 3
       Czpmxhwnd = FindWindow(vbNullString, "来自网页的消息")      '根据窗口标题查找,找到后返回句柄
       If Czpmxhwnd <> 0 Then
           aa = SetForegroundWindow(Czpmxhwnd)   '将网页调到前台
           delay 1
           SendKeys "{ENTER}", True
       End If  
    delay 1
    Call Command271_Click
    End Sub


    任务执行

    根据设定的任务,按流程对网页中控件进行操作


    Private Sub Command271_Click()
    '定义变量
    Dim IE As Object
    Dim webs, webs2, webs3, webs4, webs5, dmt,dmt1, dmt2, dmt3, dmt4, usrno, elements, element1, xxx, departmentNoHTML
    Dim vtag, worktype  '网页对象
    Dim loop1, loop2, loop3, loop4  '循环计数器  1=网页对象查找,2= ,3=工作日确定,4=流程进行
    Dim objIE As Object, myHWND
    Dim dWinFolder As New ShellWindows, t
    Dim Czpmxurl As String, Czpmxname As String
    Dim Czpmxhwnd As Long, aa        '窗口句柄
    Dim cifno$, cifcname$, ResultLink$
    
    Dim today0 '今天零点
    Dim monthday10000  '当月零点
    Dim nworkday '下一工作日
    Dim nworkdaypday2359 '下一工作日前一天23点59分
    Dim nworkday7  '下一工作日7点
    
    Dim STemp3, STemp4 As String
    Dim Rs3 As ADODB.Recordset
    Dim Rs4 As ADODB.Recordset
    Set Rs3 = New ADODB.Recordset
    Set Rs4 = New ADODB.Recordset
    
    workdaytype = "正常"
    today0 = Format(Date & "00:00:00", "YYYY/MM/DD HH:MM:SS")
    monthday10000 = Format(DateSerial(Year(Date),Month(Date), 1) & " 00:00:00", "YYYY/MM/DD HH:MM:SS")
    
    STemp3 = "select * From 工作日表 where 类型 = " & "'" &workdaytype & "'" & "order by 工作日"
    Rs3.Open STemp3, CurrentProject.Connection,adOpenKeyset, adLockOptimistic
    For loop3 = 0 To Rs3.RecordCount
       If DateDiff("d", Date, Rs3("工作日"))> 0 Then
           nworkday = Rs3("工作日")
           Exit For
       ElseIf loop3 = Rs3.RecordCount Then
           MsgBox ("请更新工作日表!")
           Exit Sub
           Exit For
       Else
           Rs3.MoveNext
       End If
    Next
    
    nworkdaypday2359 =Format(DateAdd("d", -1, nworkday) & " 23:59:59","YYYY/MM/DD HH:MM:SS")
    nworkday7 = Format(nworkday & "07:00:00", "YYYY/MM/DD HH:MM:SS")
    
    Do
       For Each objIE In dWinFolder
               If InStr(1, objIE.LocationURL, "elis-lcs.paic") > 0 Then
                    Czpmxname =objIE.LocationName            '标题
                    Czpmxurl =objIE.LocationURL              '链接
                    Exit Do   '通过链接objIE.LocationURL包含的关键字查询,或用objIE.LocationName即窗口标题包含的关键字来查询
               End If
       Next
           DoEvents
    Loop
    
       Set IE = objIE  '转换ie窗口控制权
       Do Until IE.ReadyState = 4 And IE.Busy = False
           DoEvents
       Loop
       Set dmt = IE.Document
    STemp4 = "select * From 任务流程转VBA where 任务ID = " & Me![任务ID] & " order by 流程数"
    Rs4.Open STemp4, CurrentProject.Connection,adOpenKeyset, adLockOptimistic
    For loop4 = 0 To Rs4.RecordCount - 1
       If Rs4("打开报表") = True Then
    继续:
           Set elements = dmt.all.tags("a")
           Debug.Print IE.ReadyState
           For loop1 = 0 To elements.length - 1
               If elements.Item(loop1).innerText = Rs4("一级")Then
                    elements.Item(loop1).Click
                    Exit For
               End If
           Next
    '特殊重名控件
               For loop1 = Rs4("开始搜索位置") Toelements.length - 1
                    Ifelements.Item(loop1).innerText = Rs4("二级")Then
                       elements.Item(loop1).FireEvent ("onmouseover")
                        Exit For
                    End If
               Next
    
           delay 0.5
    
           If Rs4("层级") = 3 Then
               Set elements = dmt.all.tags("a")
               Debug.Print IE.ReadyState
               For loop1 = 0 To elements.length - 1
               If elements.Item(loop1).innerText = Rs4("三级")Then
                    elements.Item(loop1).Click
                    Exit For
               End If
               Next
           ElseIf Rs4("层级") = 4 Then
               Set elements = dmt.all.tags("a")
               Debug.Print IE.ReadyState
               For loop1 = 0 To elements.length - 1
                    Ifelements.Item(loop1).innerText = Rs4("三级")Then
                       elements.Item(loop1).FireEvent ("onmouseover")
                        Exit For
                    End If
               Next
    
               For loop1 = 0 To elements.length - 1
                If elements.Item(loop1).innerText =Rs4("四级") Then
                    elements.Item(loop1).Click
                    Exit For
               End If
               Next
               delay 1
           Else
               MsgBox ("请在报表列表中添加报表层级!!!")
               Exit Sub
           End If
           delay 5
    
           GoTo 报表操作
       Else                                                                                   '打开报表——结束
    
    网页表单填写操作:
           Set dmt1 = IE.Document.frames(1).Document  'getElementsByTagName("INPUT")(0)
           Set elements = dmt1.all.tags("INPUT")       'Or "SELECT"
           If Rs4("控件类型") = "文本框" Then
               For loop1 = 0 To elements.length - 1
               If IsNull(Rs4("HTMLname")) = False Then
                    If elements.Item(loop1).Name =Rs4("HTMLname") Then
    
    ID匹配:
                        Select Case Rs4("控件值类型")
                        Case "预先制定值"
                           elements.Item(loop1).Value = Rs4("控件值")
                        Case "当时"
                           elements.Item(loop1).Value = Format(Date & " " & Time(),Rs4("时间类型"))
                        Case "手动输入"
                           elements.Item(loop1).Value = InputBox("请输入"& Rs4("报表名称") & "中" & Rs4("名称") &"的值:(" & Rs4("时间类型") & ")", "请输入")
                        Case "当月0点"
                            elements.Item(loop1).Value= Format(monthday10000, Rs4("时间类型"))
                        Case "今天0点"
                           elements.Item(loop1).Value = Format(today0, Rs4("时间类型"))
                        Case "下一工作日前一天23点59分"
                            elements.Item(loop1).Value= Format(nworkdaypday2359, Rs4("时间类型"))
                        Case "下一工作日7点"
                           elements.Item(loop1).Value = Format(nworkday7, Rs4("时间类型"))
                        Case "本月份"
                           elements.Item(loop1).Value = Format(Date, Rs4("时间类型"))
                        End Select
                        Exit For
                    End If
               Else
                    If elements.Item(loop1).ID =Rs4("HTMLID") Then
                        GoTo ID匹配
                    End If
               End If
                Next
           ElseIf Rs4("控件类型") = "复选框" Then
               For loop1 = 0 To elements.length - 1
                    If elements.Item(loop1).Value =Rs4("值") Then
                        elements.Item(loop1).Click
                        Exit For
                    End If
               Next
           ElseIf Rs4("控件类型") = "单选框" Then
               For loop1 = 0 To elements.length - 1
                    If elements.Item(loop1).Name =Rs4("HTMLname") Then
                        elements.Item(loop1).Click
                       Exit For
                    End If
               Next
           ElseIf Rs4("控件类型") = "按钮" Then
               For loop1 = 0 To elements.length - 1
                    If elements.Item(loop1).Value =Rs4("值") Then
                       elements.Item(loop1).FireEvent ("onclick")
                        delay 2
                        Exit For
                    End If
               Next
           ElseIf Rs4("控件类型") = "下拉框" Then
               Set elements = dmt1.all.tags("select")
               For loop1 = 0 To elements.length - 1
                    If IsNull(Rs4("HTMLname"))= False Then
                        Ifelements.Item(loop1).Name = Rs4("HTMLname") Then
    ID匹配2:
                           elements.Item(loop1).Value = Rs4("控件值")
                            Exit For
                        End If
                    Else
                        If elements.Item(loop1).ID= Rs4("HTMLID") Then
                        GoTo ID匹配2
                        End If
                    End If
               Next
           End If
           Rs4.MoveNext
       End If
    
    下一步:
    Next
    Me.Refresh
    Exit Sub
    Rs3.Close
    Rs4.Close
    Set Rs3 = Nothing
    Set Rs4 = Nothing
    End Sub

    任务控件添加

    用于在任务详情界面中添加需要操作的控件。


    Private Sub Command45_Click()
    Dim STemp As String
    Dim Rs As ADODB.Recordset
    Set Rs = New ADODB.Recordset
    STemp = "select * From 任务流程表 where 任务ID = " & Me![任务ID]
    Rs.Open STemp, CurrentProject.Connection,adOpenKeyset, adLockOptimistic
    Rs.AddNew
    Rs("任务ID")= Me![任务ID]
    Rs("流程数")= Rs.RecordCount + 1
    Rs("表ID")= Me.Combo60
    Rs("表名")= Me.Combo60.Column(1)
    Rs("控件ID")= Me.Combo66
    Rs("控件名")= Me.Combo66.Column(1)
    Rs("控件值类型")= Me.Combo100
    Rs("控件值")= Text76
    Rs("打开报表")= Me.Check319
    Rs.Update
    Me.Refresh
    Exit Sub
    Rs.Close
    Set Rs = Nothing
    End Sub

    寻找已打开IE

    Declare Function FindWindow Lib"user32" Alias "FindWindowA" (ByVal lpClassName As String,ByVal lpWindowName As String) As Long
    Declare Function SetForegroundWindow Lib"user32" (ByVal HWnd As Long) As Long

    窗口寻找


    Private Const ERROR_CACHE_FIND_FAIL As Long= 0
    Private Const ERROR_CACHE_FIND_SUCCESS AsLong = 1
    Private Const ERROR_FILE_NOT_FOUND As Long= 2
    Private Const ERROR_ACCESS_DENIED As Long =5
    Private Const ERROR_INSUFFICIENT_BUFFER AsLong = 122
    Private Const MAX_PATH As Long = 260
    Private Const MAX_CACHE_ENTRY_INFO_SIZE AsLong = 4096
    Private Const LMEM_FIXED As Long = &H0
    Private Const LMEM_ZEROINIT As Long =&H40
    Private Const LPTR As Long = (LMEM_FIXED OrLMEM_ZEROINIT)
    Private Const NORMAL_CACHE_ENTRY As Long =&H1
    Private Const EDITED_CACHE_ENTRY As Long =&H8
    Private Const TRACK_OFFLINE_CACHE_ENTRY AsLong = &H10
    Private Const TRACK_ONLINE_CACHE_ENTRY AsLong = &H20
    Private Const STICKY_CACHE_ENTRY As Long =&H40
    Private Const SPARSE_CACHE_ENTRY As Long =&H10000
    Private Const COOKIE_CACHE_ENTRY As Long =&H100000
    Private Const URLHISTORY_CACHE_ENTRY AsLong = &H200000
    Private Const URLCACHE_FIND_DEFAULT_FILTER AsLong = NORMAL_CACHE_ENTRY Or _
                                                       COOKIE_CACHE_ENTRY Or _
                                                       URLHISTORY_CACHE_ENTRY Or _
                                                       TRACK_OFFLINE_CACHE_ENTRY Or _
                                                       TRACK_ONLINE_CACHE_ENTRY Or _
                                                       STICKY_CACHE_ENTRY
    Private Type FILETIME
      dwLowDateTime As Long
      dwHighDateTime As Long
    End Type
    Private Type INTERNET_CACHE_ENTRY_INFO
      dwStructSize As Long
      lpszSourceUrlName As Long
      lpszLocalFileName As Long
      CacheEntryType  As Long
      dwUseCount As Long
      dwHitRate As Long
      dwSizeLow As Long
      dwSizeHigh As Long
      LastModifiedTime As FILETIME
      ExpireTime As FILETIME
      LastAccessTime As FILETIME
      LastSyncTime As FILETIME
      lpHeaderInfo As Long
      dwHeaderInfoSize As Long
      lpszFileExtension As Long
      dwExemptDelta  As Long
    End Type
    Private Declare FunctionFindFirstUrlCacheEntry Lib "wininet" Alias"FindFirstUrlCacheEntryA" (ByVal lpszUrlSearchPattern As String,lpFirstCacheEntryInfo As Any, lpdwFirstCacheEntryInfoBufferSize As Long) AsLong
    Private Declare FunctionFindNextUrlCacheEntry Lib "wininet" Alias "FindNextUrlCacheEntryA"(ByVal hEnumHandle As Long, lpNextCacheEntryInfo As Any,lpdwNextCacheEntryInfoBufferSize As Long) As Long
    Private Declare Function FindCloseUrlCacheLib "wininet" (ByVal hEnumHandle As Long) As Long
    Private Declare FunctionDeleteUrlCacheEntry Lib "wininet" Alias"DeleteUrlCacheEntryA" (ByVal lpszUrlName As String) As Long
    Private Declare Sub CopyMemory Lib"kernel32" Alias "RtlMoveMemory" (pDest As Any, pSource AsAny, ByVal dwLength As Long)
    Private Declare Function lstrcpyA Lib"kernel32" (ByVal RetVal As String, ByVal Ptr As Long) As Long
    Private Declare Function lstrlenA Lib"kernel32" (ByVal Ptr As Any) As Long
    Private Declare Function LocalAlloc Lib"kernel32" (ByVal uFlags As Long, ByVal uBytes As Long) As Long
    Private Declare Function LocalFree Lib"kernel32" (ByVal hMem As Long) As Long
    Public Sub DeleteCacheURLList()
      Dim icei As INTERNET_CACHE_ENTRY_INFO
      Dim hFile As Long
      Dim cachefile As String
      Dim posUrl As Long
      Dim posEnd As Long
      Dim dwBuffer As Long
      Dim pntrICE As Long
      hFile = FindFirstUrlCacheEntry(0&, ByVal 0, dwBuffer)
       If(hFile = ERROR_CACHE_FIND_FAIL) And _
         (Err.LastDllError = ERROR_INSUFFICIENT_BUFFER) Then
         pntrICE = LocalAlloc(LMEM_FIXED, dwBuffer)
         If pntrICE <> 0 Then
            CopyMemory ByVal pntrICE, dwBuffer, 4
            hFile = FindFirstUrlCacheEntry(vbNullString, _
                                            ByValpntrICE, _
                                           dwBuffer)
            If hFile <> ERROR_CACHE_FIND_FAIL Then
               Do
                   CopyMemory icei, ByVal pntrICE,Len(icei)
                   If (icei.CacheEntryType And _
                       NORMAL_CACHE_ENTRY) =NORMAL_CACHE_ENTRY Then
                      cachefile =GetStrFromPtrA(icei.lpszSourceUrlName)
                      Call DeleteUrlCacheEntry(cachefile)
                   End If
                   Call LocalFree(pntrICE)
                  dwBuffer = 0
                   CallFindNextUrlCacheEntry(hFile, ByVal 0, dwBuffer)
                  'allocate and assign the memoryto the pointer
                  pntrICE =LocalAlloc(LMEM_FIXED, dwBuffer)
                   CopyMemory ByVal pntrICE,dwBuffer, 4
                                 DoEvents
               Loop While FindNextUrlCacheEntry(hFile, ByVal pntrICE, dwBuffer)
            End If 'hFile
         End If 'pntrICE
      End If 'hFile
      Call LocalFree(pntrICE)
      Call FindCloseUrlCache(hFile)
    End Sub
    Private Function GetStrFromPtrA(ByVal lpszAAs Long) As String
      GetStrFromPtrA = String$(lstrlenA(ByVal lpszA), 0)
     Call lstrcpyA(ByVal GetStrFromPtrA, ByVal lpszA)
    End Function


    展开全文
  • 网页爬虫工具BeautifulSoup使用总结

    千次阅读 2017-04-26 20:27:33
    网页爬虫工具BeautifulSoup 在使用爬虫工具爬取网页的内容时,经常会出现网页格式不规范、标签不完整等等问题,导致在抓取的过程中出现内容无法爬取、内容中含有html标签等等影响结果的错误 安装、引入 安装 pip ...

    网页爬虫工具BeautifulSoup

    在使用爬虫工具爬取网页的内容时,经常会出现网页格式不规范、标签不完整等等问题,导致在抓取的过程中出现内容无法爬取、内容中含有html标签等等影响结果的错误

    安装、引入

    • 安装
      pip install beautifulsoup4
    • 引入模块
      from bs4 import BeautifulSoup

    主要方法、使用规则

    • 生成beautifulSoup对象
      soup = BeautifulSoup(html)

      或者打开本地HTML
      soup = BeautifulSoup(open('index.html'))

      在Python3中应该使用写法:
      soup = BeautifulSoup(html, "lxml")

      输出soup:
      <html><head>我是head</head><title>我是title</title><body><p>我是一个p</p><p>我也是一个p</p></body></html>

    • 格式化对象
      arr = soup.prettify()

      简单来说, prettify()方法只是让soup对象看上去像树形的xml而已,他们的内容是相同的,是指后者让标签之间换了行
      格式化之后我们得到的内容应该是:

      <html>
          <head>我是head</head>
          <title>我是title</title>
          <body>
              <p>我是一个p</p>
              <p>我也是一个p</p>
          </body>     
      </html>
    • 解析Soup对象

      对于soup之后的树形结构,我们使用以下方法来获取某个Tag:

      print(soup.title)
      :<title>我是title</title>

      使用下列方法获取文本内容:

      print(soup.title.string)
      : 我是title

      如何获取所有内容呢?

      for string in soup.strings:
          print(string)
      :\r\n
       我是head
       \r\n
       \r\n
       我是title
       \r\n
       \r\n
       我是一个p
       \r\n
       \r\n
       我也是一个p
       \r\n

      对于空行、换行我们当然是需要过滤的:

      for string in soup.stripped_strings:
          print(string)
      
      : 我是head
        我是head
        我是一个p
        我也是一个p

    以上就是大概的用法了,推荐博客:静觅 » Python爬虫利器二之Beautiful Soup的用法

    展开全文
  • 网页爬虫实践——VBA调用JS事件

    千次阅读 2018-04-15 15:53:05
    网页爬虫实践——VBA调用JS事件作者:AntoniotheFuture关键词:VBA,网页爬虫,网抓,JavaScript,Access开发平台:Access平台版本上限:2010平台版本下限:尚未出现开发语言:VBA简介:公司要求我们在双12那天之前...
  • 软件实时更新模块调用网页爬虫

    千次阅读 2020-02-07 20:30:25
    软件实时更新模块调用网页爬虫 https://www.lanzous.com/i95230j My Blog[ 我的博客 ] :新零云博客-云翼校园计划 大家可以来学习学习噢!!! 文章目录软件实时更新模块调用网页爬虫My Blog[ 我的博客 ] :[新零...
  • 网页爬虫教程

    万次阅读 多人点赞 2018-09-08 00:18:54
    学习爬虫, 首先要懂的是网页. 支撑起各种光鲜亮丽的网页的不是别的, 全都是一些代码. 这种代码我们称之为 HTML, HTML 是一种浏览器(Chrome, Safari, IE, Firefox等)看得懂的语言, 浏览器能将这种语言转换成我...
  • 网页爬虫实例一(网页截屏)

    千次阅读 2018-06-06 17:09:00
    测试网站页面,往往是考验测试眼力的时候,大多数情况都要...以下就是我个人写的网页爬虫小程序,程序主要是获取某网页链接及其页面中的所有有效链接,并将有效链接打开的页面截图保存到指定目录中coding=utf-8 impo...
  • 网页爬虫 网页中文文字提取 建立文字索引 关键词搜索 涉及到的库有: 爬虫库:requests 解析库:xpath 正则:re 分词库:jieba … 放出代码方便大家快速参考,实现一个小demo。 题目描述 搜索引擎的设计与实现 ...
  • 2020年30种最佳的免费网页爬虫软件

    千次阅读 2020-07-03 14:25:31
    原文链接:2020年30种最佳的免费网页爬虫软件 网页抓取(也称为网络数据提取,网络爬虫,数据收集和提取)是一种网页技术,从网站上提取数据。将非结构化数据转换为可以存储在本地计算机或数据库中的结构化数据。 ...
  • 如何使用Java语言实现一个网页爬虫

    万次阅读 2016-02-02 22:29:05
    在这篇博文中,我将会使用java语言一步一步的编写一个原型的网页爬虫,其实网页爬虫并没有它听起来那么难。紧跟我的教程,我相信你会在马上学会,一个小时应该可以搞定,之后你就可以享受你所获得的大量数据。这次...
  • Android网页爬虫

    千次阅读 2017-03-03 17:34:33
    静态页面 需求:获取http://blog.csdn.net/yhaolpz?viewmode=contents页面的title 首先通过okhttp以get方式请求页面: final String url = ... Request request = new Request.B
  • 之前也有了解过网页爬虫,但是只是按照网上教程练习过,今天想自己写一个爬图片的爬虫,一边写一边查资料,但是只是做了单页的爬虫,后续会继续学习做广度或深度的全网页的爬虫。
  • 2019年最新出搜索引擎蜘蛛网页爬虫大全分享,各大seo引擎搜索的蜘蛛会一次又一次访问爬取我们站点的文章内容,也会耗费一定的站点流量; 有时候就必须屏蔽一些蜘蛛浏览我们的站点,文章尾部会讲解决方案; 掌握各...
  • 最基本的网页爬虫(数据采集)

    千次阅读 2016-02-01 16:55:58
    经常看到一些交流网页爬虫的初学者来问有没有教程,什么是爬虫呢?(ps:不是爬虫类,记得最搞笑的是一个交流这个主题的群,有人进来发广告,广告的内容则是卖蜥蜴、变色龙之类的爬虫)。ok,言归正传,什么是网络爬虫呢...
  • Python静态网页爬虫项目实战

    千次阅读 2020-05-01 20:41:37
    爬虫是基于《Python爬虫开发与项目实战》一书实现的,基于现在的网页版本进行更新,可以成功抓取数据。 爬虫基础架构和流程 《Python爬虫开发与项目实战》一书中的介绍和图 首先介绍爬虫的基础架构和流程如下图...
  • python3 动态网页爬虫

    千次阅读 多人点赞 2017-11-10 16:57:22
    一个好朋友要爬个app排行网页,我就以一杯...一般来说爬虫的流程是这样:先看网页源代码,再找到要爬的字段出现的区域,用正则表达式找到这个字段,再打印或者导出结果。我们先看这个网页,需要爬的是排行、app和UV:
  • wget 网页爬虫,网页抓取工具

    千次阅读 2016-09-11 08:07:54
    如何在linux上或者是mac上简单使用爬虫或者是网页下载工具呢,常规的我们肯定是要去下载一个软件下来使用啦,可怜的这两个系统总是找不到相应的工具,这时wget出来帮助你啦!!!wget本身是拿来下载东西的,但远不止...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 130,261
精华内容 52,104
关键字:

网页爬虫

爬虫 订阅