精华内容
下载资源
问答
  • Scrapy 爬虫框架
    千次阅读
    2022-02-17 16:03:36

    Scrapy 爬虫框架


    1. 概述

    ​ Scrapy是一个可以爬取网站数据,为了提取结构性数据而编写的开源框架。Scrapy的用途非常广泛,不仅可以应用到网络爬虫中,还可以用于数据挖掘、数据监测以及自动化测试等。Scrapy是基于Twisted的异步处理框架,架构清晰、可扩展性强,可以灵活完成各种需求。

    ​ 在Scrapy的工作流程中主要包括以下几个部分:

    ​ § Scrapy Engine(框架的引擎):用于处理整个系统的数据流,触发各种事件,是整个框架的核心。

    ​ § Scheduler(调度器):用于接收引擎发过来的请求,添加至队列中,在引擎再次请求时将请求返回给引擎。可以理解为从URL队列中取出一个请求地址,同时去除重复的请求地址。

    ​ § Downloader(下载器):用于从网络下载Web资源。

    ​ § Spiders(网络爬虫):从指定网页中爬取需要的信息。

    ​ § Item Pipline(项目管道):用于处理爬取后的数据,例如数据的清洗、验证以及保存。

    ​ § Downloader Middlewares(下载器中间件):位于Scrapy引擎和下载器之间,主要用于处理 引擎与下载器之间的网络请求与响应。

    ​ § Spider Middlewares(爬虫中间件):位于爬虫与引擎之间,主要用于处理爬虫的响应输入和请求输出。

    ​ § Scheduler Middlewares(调度中间件):位于引擎和调度之间,主要用于处理从引擎发送到调度的请求和响应。

    2. 搭建Scrapy爬虫框架

    ​ 本人的系统环境是macOS,第三方开发工具PyCharm,在terminal下输入命令"pip install scrapy"。

    liuxiaowei@MacBookAir spiders % pip install scrapy
    

    说 明

    Scrapy框架在安装的过程中,同时会将lxml与pyOpenSSL模块也安装在Python环境当中。

    3. Scrapy的基本应用

    3.1 创建Scrapy项目

    ​ 在指定(也可以是任意路径)的路径下创建一个保存项目的文件夹,例如,在“/Users/liuxiaowei/PycharmProjects/爬虫练习/Scrapy爬虫框架“内运行命令行窗口,然后输入”scrapy startproject scrapyDemo“,即可创建一个名称为”scrapyDemo“的项目,如下所示:

    (venv) liuxiaowei@MacBookAir Scrapy爬虫框架 % scrapy startproject scrapyDemo
    New Scrapy project 'scrapyDemo', using template directory '/Users/liuxiaowei/PycharmProjects/爬虫练习/venv/lib/python3.9/site-packages/scrapy/templates/project', created in:
        /Users/liuxiaowei/PycharmProjects/爬虫练习/Scrapy爬虫框架/scrapyDemo
    
    You can start your first spider with:
        cd scrapyDemo
        scrapy genspider example example.com
    

    ​ 打开刚刚创建的scrapyDemo项目,项目打开以后,在左侧项目的目录结构中可以看到如下图所示的目录结构:

    目录结构中的文件说明如下:

    ​ § spiders(文件夹):用于创建爬虫文件,编写爬虫规则。

    ​ § __ init __文件:初始化文件。

    ​ § items文件:用于数据的定义,可以寄存处理后的数据。

    ​ § middlerwares文件:定义爬取时的中间件,其中包括SpiderMiddleware(爬虫中间件)、DownloaderMiddleware(下载中间件)

    ​ § pipelines文件:用于实现清洗数据、验证数据、保存数据。

    ​ § settings文件:整个框架的配置文件,主要包含配置爬虫信息,请求头、中间件等。

    ​ § scrapy.cfg文件:项目部署文件,其中定义了项目等配置文件路径等相关信息。

    3. 2 创建爬虫

    在创建爬虫时,首先需要创建一个爬虫模块文件,该文件需要放置在spiders文件夹当中。爬虫模块是用于从一个网站或多个网站中爬取数据的类,它需要继承scrapy.Spider类,scrapy.Spider类中提供了start_requests()方法实现初始化网络请求,然后通过parse()方法解析返回的结果。scrapy.Spider类中常用属性与方法含义如下:

    ​ § name:用于定义一个爬虫名称的字符串。Scrapy通过这个爬虫名称进行爬虫的查找,所以这名称必须是唯一的,不过我们可以生成多个相同的爬虫实例。如果爬取单个网站一般会用这个网站的名称作为爬虫的名称。

    § allowed_domains:包含了爬虫允许爬取的域名列表,当OffsiteMiddleware启动时,域名不在列表中的URL不会被爬取。

    § start_urls:URL的初始列表,如果没有指定特定的URL,爬虫将从该列表中进行爬取。

    § custom_settings:这是一个专属于当前爬虫的配置,是一个字典类型的数据,设置该属性会覆盖整个项目的全局,所以在设置该属性时必须在实例化前更新,必须定义为类变量。

    § settings:这是一个settings对象,通过它,我们可以获取项目的全局设置变量。

    § logger:使用Spider创建的Python日志器。

    § start_requests():该方法用于生成网络请求,它必须返回一个可迭代对象。该方法默认使用start_urls中的URL来生成request, 而request请求方式为GET,如果我们下通过POST方式请求网页时,可以使用FormRequest()重写该方法。

    § parse():如果response没有指定回调函数时,该方法是Scrapy处理response的默认方法。该方法负责处理response并返回处理的数据和下一步请求,然后返回一个包含request或Item的可迭代对象。

    § closed():当爬虫关闭时,该函数会被调用。该方法用于代替监听工作,可以定义释放资源或是收尾操作。

    3.2.1 爬取网页代码并保存为HTML文件

    ​ 以爬取下图所示的网页为例,实现爬取网页后将网页的代码以HTML文件形式保存值项目文件夹当中。

    ​ 在spiders文件夹当中创建一个名称为“crawl.py”的爬虫文件,然后在该文件中,首先创建QuotesSpider类,该类需要继承自scrapy.Spider类,然后重写start_requests()方法实现网络的请求工作,接着重写parse()方法实现向文件中写入获取的html代码。示例代码如下:

    #_*_coding:utf-8_*_
    # 作者      :liuxiaowei
    # 创建时间   :2/17/22 11:18 AM
    # 文件      :crawl.py
    # IDE      :PyCharm
    # 导入框架
    import scrapy
    
    class QuotesSpider(scrapy.Spider):
    # 定义爬虫名称
        name = 'quotes_1' 
        def start_requests(self):
            # 设置爬取目标的地址
            urls = [
                'http://quotes.toscrape.com/page/1/',
                'http://quotes.toscrape.com/page/2/',
            ]
        # 获取所有地址,有几个地址则发送几个请求
            for url in urls:
                # 发送请求
                yield scrapy.Request(url=url, callback=self.parse)
    
        def parse(self, response):
            # 获取页数
            page = response.url.split('/')[-2]
            # 根据页数设置文名称
            filename = 'quotes-%s.html' % page
            # 以写入文件模式打开文件,如果没有该文件将创建该文件
            with open(filename, 'wb') as f:
                # 向文件中写入获取的HTML代码
                f.write(response.body)
            # 输出保存文件的名称
            self.log('Saved file %s' % filename)
    

    ​ 在运行Scrapy所创建的爬虫项目时,需要在命令窗口输入“scrapy crawl quotes_1“,其中”quotes_1“是自己定义的爬虫名称。本人使用第三方开发工具PyCharm,所以需要在底部的Terminal窗口中输入运行爬虫的命令行,运行完成以后如下图所示:

    liuxiaowei@MacBookAir spiders % scrapy crawl quotes_1  # 运行爬虫的命令行
    2022-02-17 11:23:47 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
    2022-02-17 11:23:47 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
    2022-02-17 11:23:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
    2022-02-17 11:23:47 [scrapy.crawler] INFO: Overridden settings:
    {'BOT_NAME': 'scrapyDemo',
     'NEWSPIDER_MODULE': 'scrapyDemo.spiders',
     'ROBOTSTXT_OBEY': True,
     .................    # 省略中间字符
    2022-02-17 11:23:49 [quotes_1] DEBUG: Saved file quotes-1.html
    2022-02-17 11:23:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
    2022-02-17 11:23:49 [quotes_1] DEBUG: Saved file quotes-2.html
    2022-02-17 11:23:49 [scrapy.core.engine] INFO: Closing spider (finished)
    2022-02-17 11:23:49 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    ............ # 省略中间的字符
    2022-02-17 11:23:49 [scrapy.core.engine] INFO: Spider closed (finished)
    

    3.2.2 使用FormRequest()函数,实现一个POST请求

    示例代码如下:

    #_*_coding:utf-8_*_
    # 作者      :liuxiaowei
    # 创建时间   :2/17/22 12:43 PM
    # 文件      :POST请求.py
    # IDE      :PyCharm
    
    # 导入框架
    import scrapy
    # 导入json模块
    import json
    class QuotesSPider(scrapy.Spider):
        name = "quotes_2"
        # 字典类型的表单参数
        data = {'1':'能力是有限的, 而努力是无限的。',
                '2':'星光不问赶路人, 时光不负有心人。'}
        def start_requests(self):
            return [scrapy.FormRequest('http://httpbin.org/post', formdata=self.data, callback=self.parse)]
    
    
        # 响应信息
        def parse(self, response):
            # 将响应数据转换为字典类型
            response_dict = json.loads(response.text)
            # 打印转换后的响应数据
            print(response_dict)
    

    运行结果如下:

    liuxiaowei@MacBookAir spiders % scrapy crawl quotes_2    # 运行爬虫命令    
    2022-02-17 12:53:01 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
    2022-02-17 12:53:01 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
    2022-02-17 12:53:01 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
    2022-02-17 12:53:01 [scrapy.crawler] INFO: Overridden settings:
    {'BOT_NAME': 'scrapyDemo',
     'NEWSPIDER_MODULE': 'scrapyDemo.spiders',
     'ROBOTSTXT_OBEY': True,
     'SPIDER_MODULES': ['scrapyDemo.spiders']}
    2022-02-17 12:53:01 [scrapy.extensions.telnet] INFO: Telnet Password: 6965cfb5ccb132d6
    2022-02-17 12:53:01 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.memusage.MemoryUsage',
     'scrapy.extensions.logstats.LogStats']
    2022-02-17 12:53:01 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2022-02-17 12:53:01 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2022-02-17 12:53:01 [scrapy.middleware] INFO: Enabled item pipelines:
    []
    2022-02-17 12:53:01 [scrapy.core.engine] INFO: Spider opened
    2022-02-17 12:53:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2022-02-17 12:53:01 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
    2022-02-17 12:53:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/robots.txt> (referer: None)
    2022-02-17 12:53:02 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://httpbin.org/post> (referer: None)
    {'args': {}, 'data': '', 'files': {}, 'form': {'1': '能力是有限的, 而努力是无限的。', '2': '星光不问赶路人, 时光不负有心人。'}, 'headers': {'Accept': 'text/html,applicati0.9,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en', 'Content-Length': '286', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'Scrapy/2.5.1 (+https://scrapy.org)', 'X-Amzn-Trace-Id': 'Root=1-620dd4ae-3eaa8de12c3f3606567f0039'}, 'json': None, 'origin': '122.143.185.159', 'url': 'http://httpbin.org/post'}
    2022-02-17 12:53:02 [scrapy.core.engine] INFO: Closing spider (finished)
    2022-02-17 12:53:02 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 772,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 1,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 1214,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 2,
     'elapsed_time_seconds': 1.026007,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2022, 2, 17, 4, 53, 2, 396943),
     'log_count/DEBUG': 2,
     'log_count/INFO': 10,
     'memusage/max': 54693888,
     'memusage/startup': 54693888,
     'response_received_count': 2,
     'robotstxt/request_count': 1,
     'robotstxt/response_count': 1,
     'robotstxt/response_status_count/200': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2022, 2, 17, 4, 53, 1, 370936)}
    2022-02-17 12:53:02 [scrapy.core.engine] INFO: Spider closed (finished)
    

    ** 说 明**

    除了使用在命令窗口中输入命令“scrapy crawl quotes_2“启动爬虫程序以外,Scrapy还提供了可以在程序中启动爬虫的API,也就是CrawlProcess类。首先需要在CrawlProcess类初始化时传入项目的settings 信息,然后在crawl()方法中传入爬虫的名称,最后通过start()方法启动爬虫。代码如下:

    # 导入CrawlProcess类
    from scrapy.crawler import CrawlerProcess
    # 导入获取项目设置信息
    from scrapy.utils.project import get_project_settings
    
    # 程序入口
    if __name__ == "__main__":
      # 创建CrawlProcess类对象并传入项目设置信息参数
      process = CrawlerProcess(get_project_settings())
      # 设置需要启动的爬虫名称
      process.crawl('quotes_2')
      # 启动爬虫
      process.start()
    

    运行结果如下:

    /Users/liuxiaowei/PycharmProjects/爬虫练习/venv/bin/python /Users/liuxiaowei/PycharmProjects/爬虫练习/Scrapy爬虫框架/scrapyDemo/scrapyDemo/spiders/POST请求.py
    2022-02-17 13:02:16 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
    2022-02-17 13:02:16 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
    2022-02-17 13:02:16 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
    2022-02-17 13:02:16 [scrapy.crawler] INFO: Overridden settings:
    {'BOT_NAME': 'scrapyDemo',
     'NEWSPIDER_MODULE': 'scrapyDemo.spiders',
     'ROBOTSTXT_OBEY': True,
     'SPIDER_MODULES': ['scrapyDemo.spiders']}
    2022-02-17 13:02:16 [scrapy.extensions.telnet] INFO: Telnet Password: 7aa61c26ffb3372a
    2022-02-17 13:02:16 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.corestats.CoreStats',
     'scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.memusage.MemoryUsage',
     'scrapy.extensions.logstats.LogStats']
    2022-02-17 13:02:16 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2022-02-17 13:02:16 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2022-02-17 13:02:16 [scrapy.middleware] INFO: Enabled item pipelines:
    []
    2022-02-17 13:02:16 [scrapy.core.engine] INFO: Spider opened
    2022-02-17 13:02:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2022-02-17 13:02:16 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
    2022-02-17 13:02:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/robots.txt> (referer: None)
    2022-02-17 13:02:17 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://httpbin.org/post> (referer: None)
    {'args': {}, 'data': '', 'files': {}, 'form': {'1': '能力是有限的, 而努力是无限的。', '2': '星光不问赶路人, 时光不负有心人。'}, 'headers': {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en', 'Content-Length': '286', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'Scrapy/2.5.1 (+https://scrapy.org)', 'X-Amzn-Trace-Id': 'Root=1-620dd6d9-1e241f7e7f705c1172c103b5'}, 'json': None, 'origin': '122.143.185.159', 'url': 'http://httpbin.org/post'}
    2022-02-17 13:02:17 [scrapy.core.engine] INFO: Closing spider (finished)
    2022-02-17 13:02:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 772,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 1,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 1214,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 2,
     'elapsed_time_seconds': 1.030657,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2022, 2, 17, 5, 2, 17, 995487),
     'log_count/DEBUG': 2,
     'log_count/INFO': 10,
     'memusage/max': 48164864,
     'memusage/startup': 48164864,
     'response_received_count': 2,
     'robotstxt/request_count': 1,
     'robotstxt/response_count': 1,
     'robotstxt/response_status_count/200': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2022, 2, 17, 5, 2, 16, 964830)}
    2022-02-17 13:02:17 [scrapy.core.engine] INFO: Spider closed (finished)
    
    Process finished with exit code 0
    

    注 意

    如果在运行Scrapy所创建的爬虫项目时,出现SyntaxError:invalid syntax的错误信息,说明Python3.7这个版本将“async“识别成了关键字,解决此类错误,首先打开Python3.7/Lib/site-packages/twisted/conch/manhole.py文件,然后将该文件中的所有"async"关键字修改成与关键字无关的标识符,如“async_“。

    3.3 获取数据

    ​ Scrapy爬虫框架可以通过特定的CSS或者XPath表达式来选择HTML文件中的某一处,并且提取出相应的数据。CSS(Cascading Style Sheet,即层叠样式表),用于控制HTML页面的布局、字体、颜色、背景以及其他效果。XPath是一门可以在XML文档中根据元素和属性查找信息的语言。

    3.3.1 CSS提取数据

    使用CSS提取HTML文件中的某一处数据时,可以指定HTML文件中的标签名称。例如,获取前面示例网页中title标签数据时,可以使用如下命令:

    response.css('title').extract()
    

    示例代码如下:

     #_*_coding:utf-8_*_
     # 作者      :liuxiaowei
     # 创建时间   :2/17/22 2:18 PM
     # 文件      :css提取数据.py
     # IDE      :PyCharm
     
     # 导入框架
     import scrapy
     
     class QuotesSpider(scrapy.Spider):
     # 定义爬虫名称
         name = 'quotes_3'
         def start_requests(self):
             # 设置爬取目标的地址
             urls = [
                 'http://quotes.toscrape.com/page/1/',
                 'http://quotes.toscrape.com/page/2/',
             ]
         # 获取所有地址,有几个地址则发送几个请求
             for url in urls:
                 # 发送请求
                 yield scrapy.Request(url=url, callback=self.parse)
     
         def parse(self, response):
            # 获取title标签数据
             title = response.css('title').extract()
            # 打印title
             print(title)
    

    ​ 获取结果如如下:

    liuxiaowei@MacBookAir spiders % scrapy crawl quotes_3       
    2022-02-17 14:25:03 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
    2022-02-17 14:25:03 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
    2022-02-17 14:25:03 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
    2022-02-17 14:25:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
    ['<title>Quotes to Scrape</title>']
    2022-02-17 14:25:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
    ['<title>Quotes to Scrape</title>']
    2022-02-17 14:25:05 [scrapy.core.engine] INFO: Spider closed (finished)
    

    说 明

    使用CSS提取数据时返回的内容为CSS表达式所对应节点的list列表,所以在提取标签中的数据时,可以使用以下的代码:

    response.css('title::text').extract_first()
    或者
    response.css('title::text')[0].extract()
    

    运行结果如下:

    2022-02-17 14:32:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
    ['<title>Quotes to Scrape</title>'] 
     Quotes to Scrape
    

    3.3.2 XPath提取数据

    ​ 使用XPath表达式提取HTML文件中的某一处数据时,需要根据XPath表达式的语法规定来获取指定的数据信息,例如,同样获取title标签那的信息时,可以使用如下命令:

    response.xpath('//title/text()').extract_first()
    

    通过示例实现使用XPath获取上面测试页中的多条信息,代码如下:

    #_*_coding:utf-8_*_
    # 作者      :liuxiaowei
    # 创建时间   :2/17/22 2:37 PM
    # 文件      :crawl_Xpath.py
    # IDE      :PyCharm
    
    # 导入框架
    import scrapy
    
    class QuotesSpider(scrapy.Spider):
    # 定义爬虫名称
        name = 'quotes'
        def start_requests(self):
            # 设置爬取目标的地址
            urls = [
                'http://quotes.toscrape.com/page/1/',
                'http://quotes.toscrape.com/page/2/',
            ]
        # 获取所有地址,有几个地址则发送几个请求
            for url in urls:
                # 发送请求
                yield scrapy.Request(url=url, callback=self.parse)
        # 响应信息
        def parse(self, response):
            # 获取所有信息
            for quote in response.xpath(".//*[@class='quote']"):
                # 获取名人名言文字信息
                text = quote.xpath('.//*[@class="text"]/text()').extract_first()
                # 获取作者
                author = quote.xpath('.//*[@class="author"]/text()').extract_first()
                # 获取标签
                tags = quote.xpath('.//*[@class="tag"]/text()').extract()
                # 以字典形式输出信息
                print(dict(text=text, author=author, tags=tags))
    

    运行结果如下:

    liuxiaowei@MacBookAir spiders % scrapy crawl quotes
    2022-02-17 14:38:57 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
    2022-02-17 14:38:57 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
    2022-02-17 14:38:57 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
    2022-02-17 14:38:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
    2022-02-17 14:38:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
    {'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
    {'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
    {'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
    {'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
    {'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tags': ['be-yourself', 'inspirational']}
    {'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein', 'tags': ['adulthood', 'success', 'value']}
    {'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide', 'tags': ['life', 'love']}
    {'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison', 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}
    {'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt', 'tags': ['misattributed-eleanor-roosevelt']}
    {'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': ['humor', 'obvious', 'simile']}
    2022-02-17 14:38:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
    {'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'sisters']}
    {'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']}
    {'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'understand']}
    {'text': "“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”", 'author': 'Bob Marley', 'tags': ['love']}
    {'text': '“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”', 'author': 'Dr. Seuss', 'tags': ['fantasy']}
    {'text': '“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”', 'author': 'Douglas Adams', 'tags': ['life', 'navigation']}
    {'text': "“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”", 'author': 'Elie Wiesel', 'tags': ['activism', 'apathy', 'hate', 'indifference', 'inspirational', 'love', 'opposite', 'philosophy']}
    {'text': '“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”', 'author': 'Friedrich Nietzsche', 'tags': ['friendship', 'lack-of-friendship', 'lack-of-love', 'love', 'marriage', 'unhappy-marriage']}
    {'text': '“Good friends, good books, and a sleepy conscience: this is the ideal life.”', 'author': 'Mark Twain', 'tags': ['books', 'contentment', 'friends', 'friendship', 'life']}
    {'text': '“Life is what happens to us while we are making other plans.”', 'author': 'Allen Saunders', 'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans']}
    2022-02-17 14:38:59 [scrapy.core.engine] INFO: Closing spider (finished)
    2022-02-17 14:38:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    2022-02-17 14:38:59 [scrapy.core.engine] INFO: Spider closed (finished)
    

    说 明

    Scrapy的选择对象中还提供了.re()方法,这是一种可以使用正则表达式提取数据的方法,可以直接通过response.xpath().re()方式进行调用,然后在.re()方法中填入对应的正则表达式即可。

    3.3.4 翻页提取数据

    ​ 如果需要获取整个网页的所有信息就需要使用翻页功能。例如获取上节测试页中的整个网站的作者名,示例代码如下:

    #_*_coding:utf-8_*_
    # 作者      :liuxiaowei
    # 创建时间   :2/17/22 2:57 PM
    # 文件      :翻页提取数据.py
    # IDE      :PyCharm
    
    # 导入框架
    import scrapy
    
    class QuotesSpider(scrapy.Spider):
    # 定义爬虫名称
        name = 'quotes_4'
        def start_requests(self):
            # 设置爬取目标的地址
            urls = [
                'http://quotes.toscrape.com/page/1/',
                'http://quotes.toscrape.com/page/2/',
            ]
        # 获取所有地址,有几个地址则发送几个请求
            for url in urls:
                # 发送请求
                yield scrapy.Request(url=url, callback=self.parse)
        # 响应信息
        def parse(self, response):
            # div.quote
            # 获取所有信息
            for quote in response.xpath('.//*[@class="quote"]'):
                # 获取作者
                author = quote.xpath('.//*[@class="author"]/text()').extract_first()
                # 打印作者名称
                print(author)
            # 实现翻页
            for href in response.css('li.next a::attr(href)'):
                yield response.follow(href, self.parse)
    
    

    程序运行结果如下:

    # 第一页的作者名称
    2022-02-17 15:00:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
    Albert Einstein
    J.K. Rowling
    Albert Einstein
    Jane Austen
    Marilyn Monroe
    Albert Einstein
    André Gide
    Thomas A. Edison
    Eleanor Roosevelt
    Steve Martin
    # 下面是第10页的部分作者名称
    2022-02-17 15:03:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/10/> (referer: http://quotes.toscrape.com/page/9/)
    J.K. Rowling
    Jimi Hendrix
    J.M. Barrie
    E.E. Cummings
    Khaled Hosseini
    Harper Lee
    Madeleine L'Engle
    Mark Twain
    Dr. Seuss
    George R.R. Martin
    2022-02-17 15:03:54 [scrapy.core.engine] INFO: Closing spider (finished)
    

    3.3.4 创建Items

    ​ 爬取网页数据的过程,就是从非结构性的数据源中提取结构性数据。例如,在QuotesSpider类的parse()方法中已经获取到了text、author以及tags信息,如果需要将这些数据包装成结构化数据,那么就需要使用Scrapy所提供的Item类来满足这样的需求。Item对象是一个简单的容器,用于保存爬取到的数据信息,它提供了一个类似于字典的API,用于声明其可用字段的便捷语法,Item使用简单的类定义语法和Field对象来声明。在创建scrapyDemo项目时,项目的目录结构中就已经自动创建了一个items.py文件,用来定义存储数据信息的Item类,它需要继承scrapy.Item。示例代码如下:

    import scrapy
    
    class ScrapydemoItem(scrapy.Item):
      # define the fields for your item here like:
      # 定义获取名人名言文字信息
      text = scrapy.Field()
      # 定义获取的作者
      author = scrapy.Field()
      # 定义获取的标签
      tags = scrapy.Field()
      
      pass
    

    示例代码如下:

    #_*_coding:utf-8_*_
    # 作者      :liuxiaowei
    # 创建时间   :2/17/22 3:22 PM
    # 文件      :包装结构化数据.py
    # IDE      :PyCharm
    
    import scrapy  # 导入框架
    from scrapyDemo.items import ScrapydemoItem # 导入ScrapydemoItem类
    
    
    class QuotesSpider(scrapy.Spider):
        name = "quotes_5"  # 定义爬虫名称
    
        def start_requests(self):
            # 设置爬取目标的地址
            urls = [
                'http://quotes.toscrape.com/page/1/',
                'http://quotes.toscrape.com/page/2/',
            ]
            # 获取所有地址,有几个地址发送几次请求
            for url in urls:
                # 发送网络请求
                yield scrapy.Request(url=url, callback=self.parse)
            # 响应信息
    
        def parse(self, response):
            # 获取所有信息
            for quote in response.xpath(".//*[@class='quote']"):
                # 获取名人名言文字信息
                text = quote.xpath(".//*[@class='text']/text()").extract_first()
                # 获取作者
                author = quote.xpath(".//*[@class='author']/text()").extract_first()
                # 获取标签
                tags = quote.xpath(".//*[@class='tag']/text()").extract()
                # 创建Item对象
                item = ScrapydemoItem(text=text, author=author, tags=tags)
                yield item  # 输出信息
    
    class ScrapydemoItem(scrapy.Item):
       # define the fields for your item here like:
       # 定义获取名人名言文字信息
        text = scrapy.Field()
        # 定义获取的作者
        author = scrapy.Field()
        # 定义获取的标签
        tags = scrapy.Field()
    
        pass
    

    程序运行结果如下:

    liuxiaowei@MacBookAir spiders % scrapy crawl  quotes_5
    2022-02-17 15:30:04 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapyDemo)
    2022-02-17 15:30:04 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.9 (v3.9.9:ccb0e6a345, Nov 15 2021, 13:29:20) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
    2022-02-17 15:30:04 [scrapy.core.engine] INFO: Spider opened
    2022-02-17 15:30:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2022-02-17 15:30:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
    2022-02-17 15:30:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
    2022-02-17 15:30:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
    {'author': 'Albert Einstein',
     'tags': ['change', 'deep-thoughts', 'thinking', 'world'],
     'text': '“The world as we have created it is a process of our thinking. It '
             'cannot be changed without changing our thinking.”'}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
    {'author': 'J.K. Rowling',
     'tags': ['abilities', 'choices'],
     'text': '“It is our choices, Harry, that show what we truly are, far more '
             'than our abilities.”'}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
    {'author': 'Albert Einstein',
     'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles'],
     'text': '“There are only two ways to live your life. One is as though nothing '
             'is a miracle. The other is as though everything is a miracle.”'}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
    {'author': 'Jane Austen',
     'tags': ['aliteracy', 'books', 'classic', 'humor'],
     'text': '“The person, be it gentleman or lady, who has not pleasure in a good '
             'novel, must be intolerably stupid.”'}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
    {'author': 'Marilyn Monroe',
     'tags': ['be-yourself', 'inspirational'],
     'text': "“Imperfection is beauty, madness is genius and it's better to be "
             'absolutely ridiculous than absolutely boring.”'}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
    {'author': 'Albert Einstein',
     'tags': ['adulthood', 'success', 'value'],
     'text': '“Try not to become a man of success. Rather become a man of value.”'}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
    {'author': 'André Gide',
     'tags': ['life', 'love'],
     'text': '“It is better to be hated for what you are than to be loved for what '
             'you are not.”'}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
    {'author': 'Thomas A. Edison',
     'tags': ['edison', 'failure', 'inspirational', 'paraphrased'],
     'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
    {'author': 'Eleanor Roosevelt',
     'tags': ['misattributed-eleanor-roosevelt'],
     'text': '“A woman is like a tea bag; you never know how strong it is until '
             "it's in hot water.”"}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
    {'author': 'Steve Martin',
     'tags': ['humor', 'obvious', 'simile'],
     'text': '“A day without sunshine is like, you know, night.”'}
    2022-02-17 15:30:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
    {'author': 'Marilyn Monroe',
     'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'sisters'],
     'text': "“This life is what you make it. No matter what, you're going to mess "
             "up sometimes, it's a universal truth. But the good part is you get "
             "to decide how you're going to mess it up. Girls will be your friends "
             "- they'll act like it anyway. But just remember, some come, some go. "
             "The ones that stay with you through everything - they're your true "
             "best friends. Don't let go of them. Also remember, sisters make the "
             "best friends in the world. As for lovers, well, they'll come and go "
             'too. And baby, I hate to say it, most of them - actually pretty much '
             "all of them are going to break your heart, but you can't give up "
             "because if you give up, you'll never find your soulmate. You'll "
             'never find that half who makes you whole and that goes for '
             "everything. Just because you fail once, doesn't mean you're gonna "
             'fail at everything. Keep trying, hold on, and always, always, always '
             "believe in yourself, because if you don't, then who will, sweetie? "
             'So keep your head high, keep your chin up, and most importantly, '
             "keep smiling, because life's a beautiful thing and there's so much "
             'to smile about.”'}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
    {'author': 'J.K. Rowling',
     'tags': ['courage', 'friends'],
     'text': '“It takes a great deal of bravery to stand up to our enemies, but '
             'just as much to stand up to our friends.”'}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
    {'author': 'Albert Einstein',
     'tags': ['simplicity', 'understand'],
     'text': "“If you can't explain it to a six year old, you don't understand it "
             'yourself.”'}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
    {'author': 'Bob Marley',
     'tags': ['love'],
     'text': '“You may not be her first, her last, or her only. She loved before '
             'she may love again. But if she loves you now, what else matters? '
             "She's not perfect—you aren't either, and the two of you may never be "
             'perfect together but if she can make you laugh, cause you to think '
             'twice, and admit to being human and making mistakes, hold onto her '
             'and give her the most you can. She may not be thinking about you '
             'every second of the day, but she will give you a part of her that '
             "she knows you can break—her heart. So don't hurt her, don't change "
             "her, don't analyze and don't expect more than she can give. Smile "
             'when she makes you happy, let her know when she makes you mad, and '
             "miss her when she's not there.”"}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
    {'author': 'Dr. Seuss',
     'tags': ['fantasy'],
     'text': '“I like nonsense, it wakes up the brain cells. Fantasy is a '
             'necessary ingredient in living.”'}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
    {'author': 'Douglas Adams',
     'tags': ['life', 'navigation'],
     'text': '“I may not have gone where I intended to go, but I think I have '
             'ended up where I needed to be.”'}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
    {'author': 'Elie Wiesel',
     'tags': ['activism',
              'apathy',
              'hate',
              'indifference',
              'inspirational',
              'love',
              'opposite',
              'philosophy'],
     'text': "“The opposite of love is not hate, it's indifference. The opposite "
             "of art is not ugliness, it's indifference. The opposite of faith is "
             "not heresy, it's indifference. And the opposite of life is not "
             "death, it's indifference.”"}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
    {'author': 'Friedrich Nietzsche',
     'tags': ['friendship',
              'lack-of-friendship',
              'lack-of-love',
              'love',
              'marriage',
              'unhappy-marriage'],
     'text': '“It is not a lack of love, but a lack of friendship that makes '
             'unhappy marriages.”'}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
    {'author': 'Mark Twain',
     'tags': ['books', 'contentment', 'friends', 'friendship', 'life'],
     'text': '“Good friends, good books, and a sleepy conscience: this is the '
             'ideal life.”'}
    2022-02-17 15:30:05 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
    {'author': 'Allen Saunders',
     'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans'],
     'text': '“Life is what happens to us while we are making other plans.”'}
    2022-02-17 15:30:05 [scrapy.core.engine] INFO: Closing spider (finished)
    
    更多相关内容
  • 使用nodejs开发爬虫半年左右了,爬虫可以很简单,也可以很复杂。简单的爬虫定向爬取一个网站,可能有个几万或者几十万的页面请求,今天给大家介绍这款非常好用的爬虫框架crawl-pet
  • Scratch,是抓取的意思,这个Python的爬虫框架叫Scrapy,大概也是这个意思吧,就叫它:小刮刮吧。 小刮刮是一个为遍历爬行网站、分解获取数据而设计的应用程序框架,它可以应用在广泛领域:数据挖掘、信息处理和或者...
  • Scrapy是一个功能强大并且非常快速的网络爬虫框架,是非常优秀的python第三方库,也是基于python实现网络爬虫的重要的技术路线。 Scrapy的安装: 直接在命令提示符窗口执行pip install scrapy貌似不行。 我们需要先...
  • 一个简约灵活强大的Java爬虫框架。 Features: 1、代码简单易懂,可定制性强 2、简单且易于使用的api 3、支持文件下载、分块抓取 4、请求和相应支持的内容和选项比较丰富
  • 精通Python爬虫框架Scrapy.pdf
  • py也有很多爬虫框架,比如scrapy,Portia,Crawley等。 之前我个人更喜欢用C#做爬虫。 随着对nodejs的熟悉。发现做这种事情还是用脚本语言适合多了,至少不用写那么多的实体类。而且脚本一般使用比较简单。  在...
  • 一个敏捷,强大,独立的分布式爬虫框架。支持spring boot和redisson。 SeimiCrawler的目标是成为Java里最实用的爬虫框架,大家一起加油。 简介 SeimiCrawler是一个敏捷的,独立部署的,支持分布式的Java爬虫框架,...
  • 《我用爬虫一天时间“偷了”知乎一百万用户,只为证明PHP是世界上最好的语言 》所使用的程序
  • 本文实例讲述了Python爬虫框架Scrapy常用命令。分享给大家供大家参考,具体如下: 在Scrapy中,工具命令分为两种,一种为全局命令,一种为项目命令。 全局命令不需要依靠Scrapy项目就可以在全局中直接运行,而项目...
  • Python爬虫从入门到精通,这篇文档主要是针对学习python爬虫的课程,又基础的python爬虫框架scrapy开始,一步步学习到最后完整的爬虫完成,现在python爬虫应用的非常广泛,改篇详细介绍了scrapy爬虫和其他爬虫技术的...
  • 在本篇文章里小编给大家整理的是关于2020年8个效率最高的爬虫框架知识点,需要的朋友们可以学习下。
  • YayCrawler是一个基于WebMagic开发的分布式通用爬虫框架,开发语言是Java。我们知道目前爬虫框架很多,有简单的,也有复杂的,有轻 量型的,也有重量型的。您也许会问:你这个爬虫框架的优势在哪里呢?额,这个是一...
  • go爬虫框架

    2018-01-22 17:36:28
    go爬虫框架,快速的,强大的,可扩展的爬虫框架。持robots.txt * 支持自定义模块 * 支持Item管道处理 * 支持多种代理协议(socks5,http,https) * 支持XPath查询HTML/XML数据 * 做为框架,易于上手。
  • win10环境 搭建 python环境 pycharm工具搭建scrapy爬虫框架 附带教程 附带插件 根据环境自行下载 适用于各个版本, 参照教程下载安装即可
  • java开源爬虫框架

    2016-03-14 12:20:35
    需要maven构建 建议使用idea
  • Java爬虫框架(20210809123939).pdf
  • Scrapy犹如一个爬虫流水线,Item Pipeline是流水线的最后一道工序,但它是可选的,默认关闭,使用时需要将它激活。如果需要,可以定义多个Item Pipeline组件,数据会依次访问每个组件,执行相应的数据处理功
  • 项目技术栈 - 技术 功能地址 备注 前台前端 ./gulpfile.js 前端自动化 前台前端 ./public/libs/js 依据页面划分模块 ; 考虑性能,前端针对不同场景也对各类数据...这是由云天河自封装的一款行为类爬虫框架 项目代码概述
  • Cola是一个分布式的爬虫框架,用户只需编写几个特定的函数,而无需关注分布式运行的细节。
  • 全新顶级Python爬虫核心项目与框架实战教学,课程目的就是带领同学们做项目,做没有赘述的Python精华核心项目。课程分为了5个大的节点,分别是Python网络爬虫前奏阶段,主要是进行课程的预热以及概要和说明。第二...
  • 网络爬虫:Scrapy爬虫框架

    千次阅读 2022-03-11 17:33:15
    介绍了Scrapy爬虫框架的原理和基本使用方式

    Scrapy爬虫框架

    Copyright: Jingmin Wei, Pattern Recognition and Intelligent System, School of Artificial and Intelligence, Huazhong University of Science and Technology

    网络爬虫专栏链接



    本教程主要参考中国大学慕课的 Python 网络爬虫与信息提取,为个人学习笔记。

    在学习过程中遇到了一些问题,都手动记录并且修改更正,保证所有的代码为有效。且结合其他的博客总结了一些常见问题的解决方式。

    本教程不商用,仅为学习参考使用。如需转载,请联系本人。

    Reference

    爬虫 MOOC

    数据分析 MOOC

    廖雪峰老师的 Python 教程

    Scrapy爬虫框架介绍

    Scrapy 是一个快速功能强大的网络爬虫框架

    在这里插入图片描述

    Win 平台: “以管理员身份运行” cmd ,执行

    pip3 install scrapy
    pip3 --time-out=2000 install scrapy
    

    安装后小测:执行 scrapy ‐h

    在这里插入图片描述

    Scrapy 不是一个函数功能库,而是一个爬虫框架。

    爬虫框架是实现爬虫功能的一个软件结构和功能组件集合,它是一个半成品,能够帮助用户实现专业网络爬虫。

    Scrapy爬虫框架流程

    在这里插入图片描述

    结构,分布式,“5+2”结构

    数据流的三个路径:

    1 Engine 从 Spider 处获得爬取请求(Request)

    2 Engine 将爬取请求转发给 Scheduler,用于调度

    3 Engine 从 Scheduler 处获得下一个要爬取的请求

    4 Engine 将爬取请求通过中间件发送给 Downloader

    5 爬取网页后,Downloader 形成响应(Response)通过中间件发给 Engine

    6 Engine 将收到的响应通过中间件发送给 Spider 处理

    7 Spider 处理响应后产生爬取项(scraped Item)和新的爬取请求(Requests)给 Engine

    8 Engine 将爬取项发送给 Item Pipeline(框架出口)

    9 Engine 将爬取请求发送给 Scheduler

    数据流的出入口:

    Engine 控制各模块数据流,不间断从 Scheduler 处获得爬取请求,直至请求为空

    框架入口:Spider 的初始爬取请求

    框架出口:Item Pipeline

    在这里插入图片描述

    Engine

    (1) 控制所有模块之间的数据流

    (2) 根据条件触发事件

    不需要用户修改

    Downloader

    根据请求下载网页,不需要用户修改

    Scheduler

    对所有爬取请求进行调度管理,不需要用户修改

    Downloader Middleware

    目的:实施 Engine、Scheduler 和 Downloader 之间进行用户可配置的控制

    功能:修改、丢弃、新增请求或响应

    用户可以编写配置代码

    Spider

    (1) 解析 Downloader 返回的响应(Response)

    (2) 产生爬取项(scraped item)

    (3) 产生额外的爬取请求(Request)

    需要用户编写配置代码

    Item Pipelines

    (1) 以流水线方式处理 Spider 产生的爬取项

    (2) 由一组操作顺序组成,类似流水线,每个操作是一个 Item Pipeline 类型

    (3) 可能操作包括:清理、检验和查重爬取项中的 HTML 数据、将数据存储到数据库

    需要用户编写配置代码

    Spider Middleware

    目的:对请求和爬取项的再处理

    功能:修改、丢弃、新增请求或爬取项

    用户可以编写配置代码


    “5+2” 结构

    Requests库和Scrapy爬虫框架对比

    相同点:

    两者都可以进行页面请求和爬取,Python 爬虫的两个重要技术路线

    两者可用性都好,文档丰富,入门简单

    两者都没有处理 js、提交表单、应对验证码等功能(可扩展)

    不同点:

    页面级爬虫网站级爬虫
    功能库框架
    并发性考虑不足,性能较差并发性好,性能较高
    重点在于页面下载重点在于爬虫结构
    定制灵活一般定制灵活,深度定制困难
    上手十分简单入门稍难

    选用哪个技术路线开发爬虫呢?

    非常小的需求,requests 库

    不太小的需求,Scrapy 框架

    定制程度很高的需求(不考虑规模),自搭框架,requests > Scrapy

    Scrapy爬虫的常用命令

    Scrapy 是为持续运行设计的专业爬虫框架,提供操作的 Scrapy 命令行

    Win 下,启动 cmd 控制台

    scrapy -h
    

    命令行格式:

    scrapy <command> [options] [args]
    

    主要命令在 command 部分编写

    命令说明格式
    startproject创建一个新工程scrapy startproject <name> [dir]
    genspider创建一个爬虫scrapy genspider [options] <name> <domain>
    settings获得爬虫配置信息scrapy settings [options]
    crawl运行一个爬虫scrapy crawl <spider>
    list列出工程中所有爬虫scrapy list
    shell启动URL调试命令行scrapy shell [url]

    为什么Scrapy采用命令行创建和运行爬虫?

    命令行(不是图形界面)更容易自动化,适合脚本控制。本质上,Scrapy是给程序员用的,功能(而不是界面)更重要。

    Scrapy爬虫实例

    演示HTML页面地址:http://python123.io/ws/demo.html

    文件名称:demo.html

    应用Scrapy爬虫框架主要是编写配置型代码

    步骤1:建立一个Scrapy爬虫工程

    选取一个目录(E:\新桌面\python\练习代码\scrapy爬虫框架),然后执行如下命令:

    scrapy startproject python123demo
    

    在这里插入图片描述

    生成工程目录如下:

    在这里插入图片描述

    在这里插入图片描述

    步骤2:在工程中产生一个 Scrapy 爬虫

    进入工程目录(E:\新桌面\python\练习代码\scrapy爬虫框架\python123demo),然后执行如下命令:

    scrapy genspider demo python123.io
    

    在这里插入图片描述

    该命令作用:

    (1) 生成一个名称为 demo 的 spider

    (2) 在 spiders 目录下增加代码文件 demo.py

    该命令仅用于生成demo.py,该文件也可以手工生成

    步骤3:配置产生的spider爬虫

    配置:(1)初始 URL 地址 (2)获取页面后的解析方式

    打开 demo.py

    在这里插入图片描述

    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class DemoSpider(scrapy.Spider):    #继承于scrapy.Spider
        name = 'demo'   #爬虫名字
        #allowed_domains = ['python123.io']  #只能爬取该域名以下的链接
        #start_urls = ['http://python123.io/']   #要爬取页面的初始页面
        start_urls = ['http://python123.io/ws/demo.html']	#上面两行对于这个程序没必要使用
    
        def parse(self, response):
            fname = response.url.split('/')[-1]   #将response返回的对象存储在文件中
            with open(fname, 'wb') as f:
                f.write(response.body)
            self.log('Save file %s' %fname)
    
    步骤4:运行爬虫,获取网页

    在命令行下(E:\新桌面\python\练习代码\scrapy爬虫框架\python123demo),执行如下命令:

    scrapy crawl demo
    

    执行完成后,在工程目录下,生成了demo.html文件

    在这里插入图片描述

    demo.py代码的完整版本:

    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class DemoSpider(scrapy.Spider):    #继承于scrapy.Spider
        name = 'demo'   #爬虫名字
        
    	def start_requests(self):
        urls = ['http://python123.io/ws/demo.html']
        for url in urls:
            yield scrapy.Request(url = url, callback = self.parse)	
            #生成器写法,每次调用时才返回一个url链接
    
        def parse(self, response):
            fname = response.url.split('/')[-1]   #将response返回的对象存储在文件中
            with open(fname, 'wb') as f:
                f.write(response.body)
            self.log('Save file %s' %fname)
    

    由此引入下一节,yield 关键字。

    yield 关键字

    yield 生成器:生成器每次产生一个值(yield 语句),函数被冻结被唤醒后再产生一个值

    生成器是一个不断产生值的函数。

    >>> def gen(n):
    	for i in range(n):
    		yield i**2
    
    >>> for i in gen(5):
    	print(i,"",end="")
    
    	
    0 1 4 9 16 
    

    在这里插入图片描述

    生成器相比一次列出所有内容的优势:

    更节省存储空间,响应更迅速,使用更灵活。

    Scrapy爬虫的基本使用

    步骤1:创建一个工程和 Spider 模板

    步骤2:编写 Spider

    步骤3:编写 Item Pipeline

    步骤4:优化配置策略

    Request类

    class scrapy.http.Request()
    

    Request 对象表示一个 HTTP 请求。由 Spider 生成,由 Downloader 执行。

    属性或方法说明
    .urlRequest 对应的请求 URL 地址
    .method对应的请求方法,‘GET’ ‘POST’ 等
    .headers字典类型风格的请求头
    .body请求内容主体,字符串类型
    .meta用户添加的扩展信息,在 Scrapy 内部模块间传递信息使用
    .copy()复制该请求

    Response类

    class scrapy.http.Response()
    

    Response 对象表示一个 HTTP 响应。由 Downloader 生成,由 Spider 处理。

    属性或方法说明
    .urlResponse 对应的 URL 地址
    .statusHTTP 状态码,默认是 200
    .headersResponse 对应的头部信息
    .bodyResponse 对应的内容信息,字符串类型
    .flags一组标记
    .request产生 Response 类型对应的 Request 对象
    .copy()复制该响应

    Item类

    class scrapy.item.Item()
    

    Item 对象表示一个从 HTML 页面中提取的信息内容。由 Spider 生成,由 Item Pipeline 处理。

    Item 类似字典类型,可以按照字典类型操作。

    Scrapy爬虫支持多种HTML信息提取方法:

    • Beautiful Soup,lxml,re,XPath Selector,CSS Selector

    CSS Selector的基本使用

    <HTML>.css('a::attr(href)').extract()
    

    在这里插入图片描述

    展开全文
  • #资源达人分享计划#
  • DJango跟Scrapy爬虫框架实现对Zol硬件评价进行情绪分析并判断是否购买的例子 主要展示如何用Django跟Scrapy框架的使用 用Scrapyd API来实现在网页中调用爬虫
  • 在本文内容里小编给大家分享的是关于windows下搭建python scrapy爬虫框架的教学内容,需要的朋友们学习下。
  • 基于 asyncio,aiohttp,uvloop 的爬虫框架
  • Scrapy是纯Python实现的爬虫框架,用户只需要定制开发几个模块就可以轻松的实现一个爬虫,用来抓取网页内容以及各种图片,非常之方便~  Scrapy 使用wisted这个异步网络库来处理网络通讯,架构清晰,并且包含了各种...
  • 哔哩搜索-百度网盘搜索引擎是一个以node.js进行开发的百度云分享爬虫项目。同时也是一个简单高效的nodejs爬虫模型。github上有好几个这样的开源项目,但是都只提供了爬虫部分,这个项目在爬虫的基础上还增加了保存...
  • Python爬虫框架

    2021-01-20 03:09:37
    Python爬虫框架 # -*- coding: UTF-8 -*- # -Author-= JamesBen # Email: 1597757775@qq.com import requests def get_HTMLText(url): try : headers = \ { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 95,870
精华内容 38,348
关键字:

爬虫框架

友情链接: pnlqpft.rar