精华内容
下载资源
问答
  • 主要介绍了Python使用scrapy抓取网站sitemap信息的方法,涉及Python框架scrapy的使用技巧,具有一定参考借鉴价值,需要的朋友可以参考下
  • 主要介绍了Python使用Scrapy保存控制台信息到文本解析,具有一定借鉴价值,需要的朋友可以参考下
  • 主要介绍了Python使用scrapy采集数据时为每个请求随机分配user-agent的方法,涉及Python使用scrapy采集数据的技巧,非常具有实用价值,需要的朋友可以参考下
  • 主要介绍了Python使用scrapy采集时伪装成HTTP/1.1的方法,实例分析了scrapy采集的使用技巧,非常具有实用价值,需要的朋友可以参考下
  • 主要介绍了Python使用scrapy爬取阳光热线问政平台过程解析,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下
  • 主要介绍了Python使用Scrapy爬虫框架全站爬取图片并保存本地的实现代码,需要的朋友可以参考下
  • 主要介绍了Python使用scrapy采集数据过程中放回下载过大页面的方法,可实现限制下载过大页面的功能,非常具有实用价值,需要的朋友可以参考下
  • Python Scrapy爬虫,听说妹子图挺火,我整站爬取了,上周一共搞了大概8000多张图片。和大家分享一下。 核心爬虫代码 # -*- coding: utf-8 -*- from scrapy.selector import Selector import scrapy from scrapy....
  • pythonpython使用 Scrapy 爬取唯美女生网站 的资源,图片很好,爬取也有一定的难度,最终使用Scrapy获取了该网站 15000多张美眉照片....如有侵权,联系,立删除. ==================================附上源 import...

    python  python使用 Scrapy 爬取唯美女生网站 的资源,图片很好,爬取也有一定的难度,最终使用Scrapy获取了该网站 1.5W多张美眉照片....如有侵权,联系,立删除.

     

    ==================================附上源码

     

    import scrapy
    import random
    import time
    import json
    import urllib.request
    from scrapy.http.request import Request
    
    
    class VmpicSpider(scrapy.Spider):
        name = 'vmPic'
        def start_requests(self):
            self.headers = {
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',
                'Upgrade-Insecure-Requests': '1',
                'Connection': 'keep-alive',
                'TE': 'Trailers'
            }
            urls = [
                'https://www.vmgirls.com/archives.html',
                # 'http://quotes.toscrape.com/page/2/',
            ]
            for url in urls:
                yield scrapy.Request(url=url, headers=self.headers, callback=self.parse)
    
        def parse(self, response):
    
            list = response.xpath('//ul[@class="al_post_list"]/li/a/@href').getall()
            for index, item in enumerate(list):
                next_page = response.urljoin(item)
                yield response.follow(next_page, headers=self.headers, callback=self.parse2)
    
        def parse2(self, response):
            pic_list = response.xpath('//*[@class="nc-light-gallery"]/a/@href').getall()
            if len(pic_list) == 0:
                pic_list = response.xpath('//*[@class="nc-light-gallery"]/p/img/@src').getall()
            if len(pic_list) == 0:
                pic_list = response.xpath('//*[@class="nc-light-gallery"]/img/@src').getall()
            #下面实际还有三个类型的页面,出于保护网站权益考虑,我去掉了,如果您因为学习需要,自己根据网站规律获取其他的格式
            title = response.xpath('/html/head/title').get()
            title = title[len(('<title>')):len(title) - len('</title>')]
    
            fileName = r"D:\学习资料\img2\\文件内容\\" + title + ".html"
    
            # with open(fileName, 'wb') as f:
            #     f.write(response.body)
            for index, item in enumerate(pic_list):
                next_page = response.urljoin(item)
                fileName = title + str(index)
                yield scrapy.Request(url=next_page, meta={'filename': fileName}, headers=self.headers,
                                     callback=self.parse3, errback=self.errorBack)
    
        def parse3(self, response):
            fileName = r"D:\学习资料\img2\\" + response.meta['filename'] + ".jpeg"
            self.logger.error("*****************************fileName****************************************")
            self.logger.error(fileName)
            self.logger.error("*****************************fileName****************************************")
            with open(fileName, 'wb') as f:
                f.write(response.body)

    =================成果展示.美滋滋

     

    源码下载地址:

    python使用 Scrapy 爬取唯美女生网站的图片资源

    展开全文
  • Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4) 1. Scrapy框架  Scrapy是python下实现爬虫功能的框架,能够将数据解析、数据处理、数据存储合为一体功能的爬虫框架。 2. Scrapy安装 1. 安装依赖包 ...

    Python使用Scrapy框架爬取数据存入CSV文件(Python爬虫实战4)

    1. Scrapy框架

      Scrapy是python下实现爬虫功能的框架,能够将数据解析、数据处理、数据存储合为一体功能的爬虫框架。

    2. Scrapy安装

    1. 安装依赖包

    1

    2

    yum install gcc libffi-devel python-devel openssl-devel -y

    yum install libxslt-devel -y

     2. 安装scrapy

    1

    pip install scrapy<br>pip install twisted==13.1.0

     注意事项:scrapy和twisted存在兼容性问题,如果安装twisted版本过高,运行scrapy startproject project_name的时候会提示报错,安装twisted==13.1.0即可。

    3. 基于Scrapy爬取数据并存入到CSV

    3.1. 爬虫目标,获取简书中热门专题的数据信息,站点为https://www.jianshu.com/recommendations/collections,点击"热门"是我们需要爬取的站点,该站点使用了AJAX异步加载技术,通过F12键——Network——XHR,并翻页获取到页面URL地址为https://www.jianshu.com/recommendations/collections?page=2&order_by=hot,通过修改page=后面的数值即可访问多页的数据,如下图:

    3.2. 爬取内容

      需要爬取专题的内容包括:专题内容、专题描述、收录文章数、关注人数,Scrapy使用xpath来清洗所需的数据,编写爬虫过程中可以手动通过lxml中的xpath获取数据,确认无误后再将其写入到scrapy代码中,区别点在于,scrapy需要使用extract()函数才能将数据提取出来。

    3.3 创建爬虫项目

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    [root@HappyLau jianshu_hot_topic]# scrapy startproject jianshu_hot_topic

     

    #项目目录结构如下:

    [root@HappyLau python]# tree jianshu_hot_topic

    jianshu_hot_topic

    ├── jianshu_hot_topic

    │   ├── __init__.py

    │   ├── __init__.pyc

    │   ├── items.py

    │   ├── items.pyc

    │   ├── middlewares.py

    │   ├── pipelines.py

    │   ├── pipelines.pyc

    │   ├── settings.py

    │   ├── settings.pyc

    │   └── spiders

    │       ├── collection.py

    │       ├── collection.pyc

    │       ├── __init__.py

    │       ├── __init__.pyc

    │       ├── jianshu_hot_topic_spider.py    #手动创建文件,用于爬虫数据提取

    │       └── jianshu_hot_topic_spider.pyc

    └── scrapy.cfg

     

    2 directories, 16 files

    [root@HappyLau python]#

     3.4 代码内容

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    34

    35

    36

    37

    38

    39

    40

    41

    42

    43

    44

    45

    46

    47

    48

    49

    50

    51

    52

    53

    54

    55

    56

    57

    58

    59

    60

    61

    62

    63

    64

    65

    66

    67

    68

    69

    70

    71

    72

    73

    74

    75

    76

    77

    78

    79

    80

    81

    82

    83

    84

    85

    86

    87

    1. items.py代码内容,定义需要爬取数据字段

    # -*- coding: utf-8 -*-

     

    # Define here the models for your scraped items

    #

    # See documentation in:

    # https://doc.scrapy.org/en/latest/topics/items.html

     

    import scrapy

    from scrapy import Item

    from scrapy import Field

     

    class JianshuHotTopicItem(scrapy.Item):

        '''

        @scrapy.item,继承父类scrapy.Item的属性和方法,该类用于定义需要爬取数据的子段

        '''

        collection_name = Field()

        collection_description = Field()

        collection_article_count = Field()

        collection_attention_count = Field()

     

    2. piders/jianshu_hot_topic_spider.py代码内容,实现数据获取的代码逻辑,通过xpath实现

    [root@HappyLau jianshu_hot_topic]# cat spiders/jianshu_hot_topic_spider.py

    #_*_ coding:utf8 _*_

     

    import random

    from time import sleep

    from scrapy.spiders import CrawlSpider

    from scrapy.selector import Selector

    from scrapy.http import Request

    from jianshu_hot_topic.items import JianshuHotTopicItem

     

    class jianshu_hot_topic(CrawlSpider):

        '''

        简书专题数据爬取,获取url地址中特定的子段信息

        '''

        name = "jianshu_hot_topic"

        start_urls = ["https://www.jianshu.com/recommendations/collections?page=2&order_by=hot"]

     

        def parse(self,response):

            '''

            @params:response,提取response中特定字段信息

            '''

            item = JianshuHotTopicItem()

            selector = Selector(response)

            collections = selector.xpath('//div[@class="col-xs-8"]')   

            for collection in collections:

                collection_name = collection.xpath('div/a/h4/text()').extract()[0].strip()

                        collection_description = collection.xpath('div/a/p/text()').extract()[0].strip()

                        collection_article_count = collection.xpath('div/div/a/text()').extract()[0].strip().replace('篇文章','')

                        collection_attention_count = collection.xpath('div/div/text()').extract()[0].strip().replace("人关注",'').replace("· ",'')

                item['collection_name'= collection_name

                item['collection_description'= collection_description

                item['collection_article_count'= collection_article_count

                item['collection_attention_count'= collection_attention_count

     

                yield item

             

             

            urls = ['https://www.jianshu.com/recommendations/collections?page={}&order_by=hot'.format(str(i)) for in range(3,11)]

            for url in urls:

                sleep(random.randint(2,7))

                yield Request(url,callback=self.parse)

     

    3. pipelines文件内容,定义数据存储的方式,此处定义数据存储的逻辑,可以将数据存储载MySQL数据库,MongoDB数据库,文件,CSV,Excel等存储介质中,如下以存储载CSV为例:

    [root@HappyLau jianshu_hot_topic]# cat pipelines.py

    # -*- coding: utf-8 -*-

     

    # Define your item pipelines here

    #

    # Don't forget to add your pipeline to the ITEM_PIPELINES setting

    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

     

     

    import csv

     

    class JianshuHotTopicPipeline(object):

        def process_item(self, item, spider):

            = file('/root/zhuanti.csv','a+')

        writer = csv.writer(f)

        writer.writerow((item['collection_name'],item['collection_description'],item['collection_article_count'],item['collection_attention_count']))

            return item

     

    4. 修改settings文件,

    ITEM_PIPELINES = {

        'jianshu_hot_topic.pipelines.JianshuHotTopicPipeline'300,

    }

     3.5 运行scrapy爬虫

      返回到项目scrapy项目创建所在目录,运行scrapy crawl spider_name即可,如下:

    1

    2

    3

    4

    5

    6

    [root@HappyLau jianshu_hot_topic]# pwd

    /root/python/jianshu_hot_topic

    [root@HappyLau jianshu_hot_topic]# scrapy crawl jianshu_hot_topic

    2018-02-24 19:12:23 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: jianshu_hot_topic)

    2018-02-24 19:12:23 [scrapy.utils.log] INFO: Versions: lxml 3.2.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 13.1.0, Python 2.7.5 (default, Aug  4 201700:39:18- [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)], pyOpenSSL 0.13.1 (OpenSSL 1.0.1e-fips 11 Feb 2013), cryptography 1.7.2, Platform Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core

    2018-02-24 19:12:23 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE''jianshu_hot_topic.spiders''SPIDER_MODULES': ['jianshu_hot_topic.spiders'], 'ROBOTSTXT_OBEY'True'USER_AGENT''Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0''BOT_NAME''jianshu_hot_topic'}

     查看/root/zhuanti.csv中的数据,即可实现。

     

    4. 遇到的问题总结

    1. twisted版本不见容,安装过新的版本导致,安装Twisted (13.1.0)即可

    2. 中文数据无法写入,提示'ascii'错误,通过设置python的encoding为utf即可,如下:

    1

    2

    3

    4

    5

    6

    7

    8

    >>> import sys

    >>> sys.getdefaultencoding()

    'ascii'

    >>> reload(sys)

    <module 'sys' (built-in)>

    >>> sys.setdefaultencoding('utf8')

    >>> sys.getdefaultencoding()

    'utf8'

     3. 爬虫无法获取站点数据,由于headers导致,载settings.py文件中添加USER_AGENT变量,如:

    1

    USER_AGENT="Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"

     Scrapy使用过程中可能会遇到结果执行失败或者结果执行不符合预期,其现实的logs非常详细,通过观察日志内容,并结合代码+网上搜索资料即可解决。

    展开全文
  • 创建项目 : scrapy startproject tencent 创建爬虫:scrapy genspider tc careers.tencent.com tc.py # -*- coding: utf-8 -*- import scrapy import json class TcSpider(scrapy.Spider): name = 'tc' allowed_...
  • 主要介绍了使用scrapy发送post请求的坑,小编觉得挺不错的,现在分享给大家,也给大家做个参考。一起跟随小编过来看看吧
  • Python使用scrapy爬虫

    2020-06-19 17:13:12
    首先创建scrapy 准备工作 pipev install scrapy scrapy startproject ZhihuLove cd进入项目的spider目录 scrapy genspider ZhihuLove 'www.zhihu.com' 然后用 vscode 或者 pycharm打开项目 创建完成后会...

    首先创建scrapy

    准备工作

    pipev install scrapy

    scrapy startproject ZhihuLove

    cd进入项目的spider目录

    scrapy genspider ZhihuLove 'www.zhihu.com'

    然后用 vscode 或者 pycharm打开项目

     

    创建完成后会自动创建好项目

    接下来可以开心的写需要的数据了

    爬数据主要是处理网站的节点

    首先找到大节点 question = response.xpath('//div[@class="Question-main"]') 因为是一个div层结构

    然后列表为 infos = question.xpath('./div//div[@class="List-item"]') 列表的节点为List-item结构

    要获取对应的字段继续去找相对应地子节点即可

    然后再把数据存储到对应的数据库中 

    注意:需要引入伪装代码:

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for ZhihuLove project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'ZhihuLove'
    
    SPIDER_MODULES = ['ZhihuLove.spiders']
    NEWSPIDER_MODULE = 'ZhihuLove.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'ZhihuLove (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    
    # 配置默认的请求头
    DEFAULT_REQUEST_HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0",
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
    }
    # 配置使用Pipeline
    ITEM_PIPELINES = {
        'ZhihuLove.pipelines.ZhihulovePipeline': 300,
    }

    项目链接在 https://github.com/ArdWang/ZhihuLove 喜欢就点个star吧

    展开全文
  • 在前面的章节中都介绍了scrapy如何爬取网页数据,今天介绍下如何爬取图片。 下载图片需要用到ImagesPipeline这个类,首先介绍下工作流程: 1 首先需要在一个爬虫中,获取到图片的url并存储起来。也是就是我们项目...

    在前面的章节中都介绍了scrapy如何爬取网页数据,今天介绍下如何爬取图片。

    下载图片需要用到ImagesPipeline这个类,首先介绍下工作流程:

    1 首先需要在一个爬虫中,获取到图片的url并存储起来。也是就是我们项目中test_spider.py中testSpider类的功能

    2 项目从爬虫返回,进入到项目通道也就是pipelines中

    3 在通道中,在第一步中获取到的图片url将被scrapy的调度器和下载器安排下载。

    4 下载完成后,将返回一组列表,包括下载路径,源抓取地址和图片的校验码

    大致的过程就以上4步,那么我们来看下代码如何具体实现

    1 首先在settings.py中设置下载通道,下载路径以下载参数

    ITEM_PIPELINES = {

    #    'test1.pipelines.Test1Pipeline': 300,

    'scrapy.pipelines.images.ImagesPipeline':1,

    }

    IMAGES_STORE ='E:\\scrapy_project\\test1\\image'

    IMAGES_EXPIRES = 90

    IMAGES_MIN_HEIGHT = 100

    IMAGES_MIN_WIDTH = 100

    其中IMAGES_STORE是设置的是图片保存的路径。IMAGES_EXPIRES是设置的项目保存的最长时间。IMAGES_MIN_HEIGHT和IMAGES_MIN_WIDTH是设置的图片尺寸大小

    2 设置完成后,我们就开始写爬虫程序,也就是第一步获取到图片的URL。我们以http://699pic.com/people.html网站图片为例。中文名称为摄图网。里面有各种摄影图片。我们首先来看下网页结构。图片的地址都保存在

    <div class=“swipeboxex”><div class=”list”><a><image>中的属性data-original

    首先在item.py中定义如下几个结构体

    根据这个网页结构,在test_spider.py文件中的代码如下。在items中保存了

    3 在第二步中获取到了图片url后,下面就要进入pipeline管道。进入pipeline.py。首先引入ImagesPipeline

    from scrapy.pipelines.imagesimport ImagesPipeline

    然后只需要将Test1Pipeline继承自ImagesPipeline就可以了。里面可以不用写任意代码

    class Test1Pipeline(ImagesPipeline):

    pass

    ImagesPipeline中主要介绍2个函数。get_media_requests和item_completed.我们来看下代码的实现:

    def get_media_requests(self, item, info):

    return [Request(x)for xin item.get(self.images_urls_field, [])]

     

    从代码中可以看到get_meida)_requests是从管道中取出图片的url并调用request函数去获取这个url

    Item_completed函数

    def item_completed(self, results, item, info):

    if isinstance(item,dict)or self.images_result_fieldin item.fields:

    item[self.images_result_field] = [xfor ok, xin resultsif ok]

    return item

    当下载完了图片后,将图片的路径以及网址,校验码保存在item中

     

    下面运行代码,这里贴出log中的运行日志:

    2017-06-09 22:38:17 [scrapy] INFO: Scrapy 1.1.0 started (bot: test1)

    2017-06-09 22:38:17 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'test1.spiders', 'IMAGES_MIN_HEIGHT': 100, 'SPIDER_MODULES': ['test1.spiders'], 'BOT_NAME': 'test1', 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36', 'LOG_FILE': 'log', 'IMAGES_MIN_WIDTH': 100}

    2017-06-09 22:38:18 [scrapy] INFO: Enabled extensions:

    ['scrapy.extensions.logstats.LogStats',

    'scrapy.extensions.telnet.TelnetConsole',

    'scrapy.extensions.corestats.CoreStats']

    2017-06-09 22:38:18 [scrapy] INFO: Enabled downloader middlewares:

    ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

    'scrapy.downloadermiddlewares.retry.RetryMiddleware',

    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

    'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',

    'scrapy.downloadermiddlewares.stats.DownloaderStats']

    2017-06-09 22:38:18 [scrapy] INFO: Enabled spider middlewares:

    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

    'scrapy.spidermiddlewares.referer.RefererMiddleware',

    'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

    'scrapy.spidermiddlewares.depth.DepthMiddleware']

    2017-06-09 22:38:19 [scrapy] INFO: Enabled item pipelines:

    ['scrapy.pipelines.images.ImagesPipeline']

    2017-06-09 22:38:19 [scrapy] INFO: Spider opened

    2017-06-09 22:38:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

    2017-06-09 22:38:19 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023

    2017-06-09 22:38:19 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:19 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:19 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Crawled (200) (referer: None)

    2017-06-09 22:38:20 [scrapy] DEBUG: File (downloaded): Downloaded file from referred in

    2017-06-09 22:38:20 [scrapy] DEBUG: Scraped from <200 http://699pic.com/people.html>

    {'image_urls': [u'http://img95.699pic.com/photo/50004/2199.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00002/3435.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00038/9869.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00029/2430.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00037/9614.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/50002/2276.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00045/2871.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00015/5234.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00021/8332.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00043/3285.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/50001/6744.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/50001/1769.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00031/3314.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/50006/3243.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/50000/4373.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00013/4480.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00002/9278.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00017/0701.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00022/2328.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00019/6796.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00004/4944.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/50016/6025.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00002/3437.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00014/2406.jpg_wh300.jpg',

    u'http://img95.699pic.com/photo/00007/4890.jpg_wh300.jpg'],

    'images': [{'checksum': '09d39902660ad2e047d721f53e7a2019',

    'path': 'full/eb33d8812b7a06fbef2f57b87acca8cbd3a1d82b.jpg',

    'url': 'http://img95.699pic.com/photo/50004/2199.jpg_wh300.jpg'},

    {'checksum': 'dc87cfe3f9a3ee2728af253ebb686d2d',

    'path': 'full/63334914ac8fc79f8a37a5f3bd7c06abeffac2a8.jpg',

    'url': 'http://img95.699pic.com/photo/00002/3435.jpg_wh300.jpg'},

    {'checksum': 'b19e55369fa0a5061f48fe997b0085e5',

    'path': 'full/4f06b529a4a5fd752339fc120a1bcf89a7125da0.jpg',

    'url': 'http://img95.699pic.com/photo/00038/9869.jpg_wh300.jpg'},

    {'checksum': '786e0cfacf113d00302794c5fda7e93f',

    'path': 'full/9540903ee92ec44d59916b4a5ebcbcec32e50a67.jpg',

    'url': 'http://img95.699pic.com/photo/00029/2430.jpg_wh300.jpg'},

    {'checksum': 'c4266a539b046c1b8f66609b3fef36e2',

    'path': 'full/ea9bda3236f78ac319b02add631d7a713535c8d0.jpg',

    'url': 'http://img95.699pic.com/photo/00037/9614.jpg_wh300.jpg'},

    {'checksum': '0b4d75bb3289a2bda05dac84bfee3591',

    'path': 'full/1831779855e3767e547653a823a4c986869bb6df.jpg',

    'url': 'http://img95.699pic.com/photo/50002/2276.jpg_wh300.jpg'},

    {'checksum': '0c7b9e849acf00646ef06ae4ade0a024',

    'path': 'full/cac4915246f820035198c014a770c80cb078300b.jpg',

    'url': 'http://img95.699pic.com/photo/00045/2871.jpg_wh300.jpg'},

    {'checksum': '4482da90e8b468e947cd2872661c6cac',

    'path': 'full/118dc882386112ab593e86743222200bedb8752a.jpg',

    'url': 'http://img95.699pic.com/photo/00015/5234.jpg_wh300.jpg'},

    {'checksum': '12b8d957b3a1fd7a29baf41c06b17105',

    'path': 'full/ac2e45ed2716678a5045866a65d86e304e97e8ad.jpg',

    'url': 'http://img95.699pic.com/photo/00021/8332.jpg_wh300.jpg'},

    {'checksum': '44275a83945dfe4953ecc82c7105b869',

    'path': 'full/b0ec1a2775ec55931e133c46e9f1d8680bea39e6.jpg',

    'url': 'http://img95.699pic.com/photo/00043/3285.jpg_wh300.jpg'},

    {'checksum': 'e960e51ebbc4ac7bd12c974ca8a33759',

    'path': 'full/5f2a293333ea3f1fd3c63b7d56a9f82e1d9ff4d8.jpg',

    'url': 'http://img95.699pic.com/photo/50001/6744.jpg_wh300.jpg'},

    {'checksum': '08acf086571823fa739ba1b0aa5c99f3',

    'path': 'full/70b24f8e7e1c4b4d7e3fe7cd6056b1ac4904f92e.jpg',

    'url': 'http://img95.699pic.com/photo/50001/1769.jpg_wh300.jpg'},

    {'checksum': '2599ccd44c640948e5331688420ec8af',

    'path': 'full/a20b9c4eaf12a56e8b5506cc2a27d28f3e436595.jpg',

    'url': 'http://img95.699pic.com/photo/00031/3314.jpg_wh300.jpg'},

    {'checksum': '39bcb67a642f1cc9776be934df292f59',

    'path': 'full/8b3d1eee34fb752c5b293252b10f8d9793f05240.jpg',

    'url': 'http://img95.699pic.com/photo/50006/3243.jpg_wh300.jpg'},

    {'checksum': 'd2e554d618de6d53ffd76812bf135edf',

    'path': 'full/907c628df31bf6d6f2077b5f7fd37f02ea570634.jpg',

    'url': 'http://img95.699pic.com/photo/50000/4373.jpg_wh300.jpg'},

    {'checksum': '6fc5c1783080cee030858b9abb5ff6a5',

    'path': 'full/f42a1cca0f7ec657aa66eca9a751a0c4d8defbb1.jpg',

    'url': 'http://img95.699pic.com/photo/00013/4480.jpg_wh300.jpg'},

    {'checksum': '906d1b79cec6ac8a0435b2c5c9517b4a',

    'path': 'full/35853ef411058171381dc65e7e2c824a86caecbe.jpg',

    'url': 'http://img95.699pic.com/photo/00002/9278.jpg_wh300.jpg'},

    {'checksum': '3119eca5ffdf5c0bb2984d7c6dc967c0',

    'path': 'full/b294005510b8159f7508c5f084a5e0dbbfb63fbe.jpg',

    'url': 'http://img95.699pic.com/photo/00017/0701.jpg_wh300.jpg'},

    {'checksum': '7ce71cece48dcf95b86e0e5afce9985d',

    'path': 'full/035017aa993bb72f495a7014403bd558f1883430.jpg',

    'url': 'http://img95.699pic.com/photo/00022/2328.jpg_wh300.jpg'},

    {'checksum': 'ac1d9a9569353ed92baddcedd9a0d787',

    'path': 'full/71038a4d15c0e5613831d49d7a6d5901d40426ac.jpg',

    'url': 'http://img95.699pic.com/photo/00019/6796.jpg_wh300.jpg'},

    {'checksum': 'ad1732345aeb5534cb77bd5d9cffd847',

    'path': 'full/c01104e93ed52a3d62c6a5edb2970281b2d0902b.jpg',

    'url': 'http://img95.699pic.com/photo/00004/4944.jpg_wh300.jpg'},

    {'checksum': 'c3c216d12719b5c00df3a85baef90aff',

    'path': 'full/1470a42ad964c97867459645b14adc73c870f0e1.jpg',

    'url': 'http://img95.699pic.com/photo/50016/6025.jpg_wh300.jpg'},

    {'checksum': '74c37b5a6e417ecfa7151683b178e2b3',

    'path': 'full/821cea60c4ee6b388b5245c6cc7e5aa5d07dacb5.jpg',

    'url': 'http://img95.699pic.com/photo/00002/3437.jpg_wh300.jpg'},

    {'checksum': 'f991181e76a5017769140756097b18f5',

    'path': 'full/a8aa882e28f0704dd0c32d783df860f8b5617b45.jpg',

    'url': 'http://img95.699pic.com/photo/00014/2406.jpg_wh300.jpg'},

    {'checksum': '0878fe7552a6c1c5cfbd45ae151891c8',

    'path': 'full/9c147ed85db3823b1072a761e30e8f2a517e7af0.jpg',

    'url': 'http://img95.699pic.com/photo/00007/4890.jpg_wh300.jpg'}]}

    2017-06-09 22:38:20 [scrapy] INFO: Closing spider (finished)

    2017-06-09 22:38:20 [scrapy] INFO: Dumping Scrapy stats:

    {'downloader/request_bytes': 10145,

    'downloader/request_count': 26,

    'downloader/request_method_count/GET': 26,

    'downloader/response_bytes': 1612150,

    'downloader/response_count': 26,

    'downloader/response_status_count/200': 26,

    'file_count': 25,

    'file_status_count/downloaded': 25,

    'finish_reason': 'finished',

    'finish_time': datetime.datetime(2017, 6, 9, 14, 38, 20, 962000),

    'item_scraped_count': 1,

    'log_count/DEBUG': 53,

    'log_count/INFO': 7,

    'response_received_count': 26,

    'scheduler/dequeued': 1,

    'scheduler/dequeued/memory': 1,

    'scheduler/enqueued': 1,

    'scheduler/enqueued/memory': 1,

    'start_time': datetime.datetime(2017, 6, 9, 14, 38, 19, 382000)}

    2017-06-09 22:38:20 [scrapy] INFO: Spider closed (finished)

     

    从log日志中可以看到image_urls和images中分别保存了图片URL以及返回的路径。比如下面的这个例子。Jpg图片保存在full文件夹下面

    full/1470a42ad964c97867459645b14adc73c870f0e1.jpg

    结合之前设置的IMAGES_STORE的值,图片的完整路径应该是在E:\scrapy_project\test1\image\full。在这个路径下可以看到下载完成的图片

    根据scrapy的官方文档,get_media_requests和item_completed也可以自己重写,代码如下,和系统自带的代码差不多是一样的

    class Test1Pipeline(ImagesPipeline):

    def get_media_requests(self,item,info):

    for image_urlin item['image_urls']:

    yield Request(image_url)

    def item_completed(self, results, item, info):

    image_path=[x['path']for ok,xin resultsif ok]

    print image_path

    if not image_path:

    raise DropItem('items contains no images')

    item['image_paths']=image_path

    return item

     

    首先在get_media_requests对每个图片的路径进行请求,并对各个URL图片返回一个Request,这些请求被管道处理,当它们下载完成后。结果以元组列表的形式返回到item_completed

    Success 是一个布尔值,当成功下载是为True,失败为False

    Image_infor_or_error 是一个包含下列关键字的字典(如果成功为True,失败为Twisted Failure)

    url:图片下载的url,就是从get_media_requests中返回请求的URL

    path:图片存储的路径。就像full/1ca5879492b8fd606df1964ea3c1e2f4520f076f.jpg这种格式的

    checksum:图片内容的MD5 hash

    item_completed需要返回一个输出,将其送到随后的项目管道阶段。因此我们将下载图片的路径存储在image_path中并返回

    展开全文
  • python 的简单语法 学会使用pip3 安装缺少的模块 遇到的问题 缺少Microsoft Visual C++ Build Tools 点击下载安装 缺少win32模块 pip install pywin32 安装成功,开始搞事情 ...
  • Scrapy框架官方网址:http://doc.scrapy.org/en/latest Scrapy中文维护站点:...Python 2 / 3 升级pip版本:pip install --upgrade pip 通过pip 安装 Scrapy 框架pip ins...
  • 本文章主要介绍scrapy的基本使用方法,介绍框架结构及安装。 1 整个项目的构成 2 一个完整的项目设计四个python文件的编写,分别是items.py、qutoes_spider.py(scrapy genspider qutoes_spider.py jycinema....
  • Python使用Scrapy框架爬虫(一)

    千次阅读 2018-04-30 17:18:14
    软件环境:Pycharm 2018 python:3.6 1.首先我们需要安装scrapy模块,pip install scrapy ,不过这种方式经常会遇到许多未知的bug 建议参考这篇博客:...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 6,029
精华内容 2,411
关键字:

python使用scrapy

python 订阅