精华内容
下载资源
问答
  • 亚马逊图书爬虫.py

    2019-05-30 00:18:35
    这个是爬取亚马逊图书资源的文件,实现简单的爬虫机制。
  • 主要介绍了Python爬取当当、京东、亚马逊图书信息代码实例,具有一定借鉴价值,需要的朋友可以参考下。
  • 目标:抓取亚马逊图书信息, 有图书的名字、封面图片地址、图书url地址、作者、出版社、出版时间、价格、图书所属大分类、图书所属小的分类、分类的url地址 url:...

    1.需求了解

    需求:抓取亚马逊图书的信息

    目标:抓取亚马逊图书信息, 有图书的名字、封面图片地址、图书url地址、作者、出版社、出版时间、价格、图书所属大分类、图书所属小的分类、分类的url地址

    url:https://www.amazon.cn/图书/b/ref=sd_allcat_books_l1?ie=UTF8&node=658390051

    创建文件

    scrapy startproject book
    cd book
    scrapy genspider -t crawl amazon amazon.cn
    

    2. 思路分析

    1. 确定rule中的规则

    可以通过连接提取器中的restrict_xpath来实现url地址的,不需要定位到具体的url地址字符串,定位到准确的标签位置即可

    注意:定位到的标签中不能包含不相关的url,否则请求不相关的地址后,数据提取会报错

    通过分析大分类和小分类的url地址,发现他们的规则相同,即一个Rule可以实现从大分类到小分类的到列表页的过程
    大分类的url地址,用’//[@id=“leftNav”]/ul[1]/ul/div/li’提取
    这里写图片描述
    小分类的url地址,用’//
    [@id=“leftNav”]/ul[1]/ul/div/li’提取
    这里写图片描述
    列表页的url位置,用’//*[@id=“pagn”]‘提取
    这里写图片描述
    进入详情页翻页的url地址位置,用’//div[@class=“a-row”]/div[1]/a’提取
    这里写图片描述
    设置Relu抓取这些位置的url

    rules = (
        Rule(LinkExtractor(restrict_xpaths=r'//*[@id="leftNav"]/ul[1]/ul/div/li'), follow=True),
        Rule(LinkExtractor(restrict_xpaths=r'//*[@id="pagn"]'), follow=True),
        Rule(LinkExtractor(restrict_xpaths=r'//div[@class="a-row"]/div[1]/a'), callback='parse_item', follow=True),
        )
    

    全部写完后:

    # -*- coding: utf-8 -*-
    import re
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    
    class AmazonSpider(CrawlSpider):
        name = 'amazon'
        allowed_domains = ['amazon.cn']
        start_urls = ['https://www.amazon.cn/%E5%9B%BE%E4%B9%A6/b/ref=sd_allcat_books_l1?ie=UTF8&node=658390051']
    
        rules = (
            Rule(LinkExtractor(restrict_xpaths=r'//*[@id="leftNav"]/ul[1]/ul/div/li'), follow=True),
            Rule(LinkExtractor(restrict_xpaths=r'//*[@id="pagn"]'), follow=True),
            Rule(LinkExtractor(restrict_xpaths=r'//div[@class="a-row"]/div[1]/a'), callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            """提取图书的信息"""
            # 名字、封面图片地址、图书url地址、作者、出版社、出版时间、价格、图书所属大分类、图书所属小的分类、分类的url地址
            item = {}
            name = response.xpath('//*[@id="productTitle"]/text()').extract_first()
            if name:
                # 纸质书
                item['name'] = name
            else:
                # 电子书
                item['name'] = response.xpath('//*[@id="ebooksProductTitle"]/text()').extract_first()
            item['url'] = response.url
            item['author'] = response.xpath('//*[@id="bylineInfo"]/span[1]/a/text()').extract_first()
            # 价格平装,电子版虽然不一样,但是class属性都包含 a-color-price , 第一个就是该书的价格
            item['price'] = response.xpath('//span[contains(@class,"a-color-price")]/text()').extract_first().strip()
    
            # 获取分类信息
            a_s = response.xpath('//*[@id="wayfinding-breadcrumbs_feature_div"]//a')
            # 由于从不同目录进入, 获取到的分类不同, 所以需要对多种情况进行判断
            if len(a_s) >= 1:
                item['b_category_name'] = a_s[0].xpath('./text()').extract_first().strip()
                item['b_category_url'] = response.urljoin(a_s[0].xpath('./@href').extract_first())
            if len(a_s) >= 2:
                item['m_category_name'] = a_s[1].xpath('./text()').extract_first().strip()
                item['m_category_url'] = response.urljoin(a_s[1].xpath('./@href').extract_first())
            if len(a_s) >= 3:
                item['s_category_name'] = a_s[2].xpath('./text()').extract_first().strip()
                item['s_category_url'] = response.urljoin(a_s[2].xpath('./@href').extract_first())
    
            # 使用正则提取出版社 和 出版时间, 可以做到电子书与纸质书统一
            publisher = re.findall('<li><b>出版社:</b>(.*?)</li>', response.text)
            # 有个别图书是没有出版社信息的
            if len(publisher) != 0:
                item['publisher'] = publisher[0]
                item['published_date'] = re.findall('\((.*?)\)', publisher[0])[0]
    
            yield item
    

    然后修改start_url为redis_key
    继承的类改为rediscrawlspider

    class AmazonSpider(RedisCrawlSpider):
        name = 'amazon'
        allowed_domains = ['amazon.cn']
        # start_urls = ['https://www.amazon.cn/%E5%9B%BE%E4%B9%A6/b/ref=sd_allcat_books_l1?ie=UTF8&node=658390051']
        redis_key = 'amazon:start_urls'
    

    在脚本中写入url,启动redis数据库后写入数据

    #!/usr/bin/env python
    # -*- coding:utf-8 -*-
    
    import redis
    
    # 将start_url 存储到redis中的redis_key中,让爬虫去爬取
    redis_Host = "127.0.0.1"
    redis_key = 'amazon:start_urls'
    
    # 创建redis数据库连接
    rediscli = redis.Redis(host=redis_Host, port=6379, db="0")
    
    # 先将redis中的requests全部清空
    flushdbRes = rediscli.flushdb()
    print("flushdbRes = {}".format(flushdbRes))
    rediscli.lpush(redis_key, "https://www.amazon.cn/%E5%9B%BE%E4%B9%A6/b/ref=sd_allcat_books_l1?ie=UTF8&node=658390051")
    

    最后在settings中写入redis的配置项

    # Scrapy_Redis配置信息
    
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    SCHEDULER_PERSIST = True
    
    # 配置使用Redis来存储爬取到的数据, 如果不需要使用Redis存储可以不配置
    ITEM_PIPELINES = {
        'scrapy_redis.pipelines.RedisPipeline': 400,
    }
    
    # Redis数据库配置
    REDIS_URL = "redis://127.0.0.1:6379"
    

    然后就可以启动项目了

    展开全文
  • 这篇文章主要介绍了Python爬取当当、京东、亚马逊图书信息代码实例,具有一定借鉴价值,需要的朋友可以参考下。注:1.本程序采用MSSQLserver数据库存储,请运行程序前手动修改程序开头处的数据库链接信息2.需要bs4、...

    这篇文章主要介绍了Python爬取当当、京东、亚马逊图书信息代码实例,具有一定借鉴价值,需要的朋友可以参考下。

    8b6e5472c2a14f5d023d38645839c570.png

    注:

    1.本程序采用MSSQLserver数据库存储,请运行程序前手动修改程序开头处的数据库链接信息

    2.需要bs4、requests、pymssql库支持

    3.支持多线程

    from bs4 import BeautifulSoup ​import re,requests,pymysql,threading,os,traceback ​ ​try: ​ conn = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='root', db='book',charset="utf8") ​ cursor = conn.cursor() ​except: ​ print('错误:数据库连接失败') ​ ​#返回指定页面的html信息 ​def getHTMLText(url): ​ try: ​ headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'} ​ r = requests.get(url,headers = headers) ​ r.raise_for_status() ​ r.encoding = r.apparent_encoding ​ return r.text ​ except: ​ return '' ​#返回指定url的Soup对象 ​def getSoupObject(url): ​ try: ​ html = getHTMLText(url) ​ soup = BeautifulSoup(html,'html.parser') ​ return soup ​ except: ​ return '' ​#获取该关键字在图书网站上的总页数 ​def getPageLength(webSiteName,url): ​ try: ​ soup = getSoupObject(url) ​ if webSiteName == 'DangDang': ​ a = soup('a',{'name':'bottom-page-turn'}) ​ return a[-1].string ​ elif webSiteName == 'Amazon': ​ a = soup('span',{'class':'pagnDisabled'}) ​ return a[-1].string ​ except: ​ print('错误:获取{}总页数时出错...'.format(webSiteName)) ​ return -1​ ​class DangDangThread(threading.Thread): ​ def __init__(self,keyword): ​ threading.Thread.__init__(self) ​ self.keyword = keyword ​ def run(self): ​ print('提示:开始爬取当当网数据...') ​ count = 1​ ​ length = getPageLength('DangDang','http://search.dangdang.com/?key={}'.format(self.keyword))#总页数 ​ tableName = 'db_{}_dangdang'.format(self.keyword) ​ ​ try: ​ print('提示:正在创建DangDang表...') ​ cursor.execute('create table {} (id int ,title text,prNow text,prPre text,link text)'.format(tableName)) ​ print('提示:开始爬取当当网页面...') ​ for i in range(1,int(length)): ​ url = 'http://search.dangdang.com/?key={}&page_index={}'.format(self.keyword,i) ​ soup = getSoupObject(url) ​ lis = soup('li',{'class':re.compile(r'line'),'id':re.compile(r'p')}) ​ for li in lis: ​ a = li.find_all('a',{'name':'itemlist-title','dd_name':'单品标题'}) ​ pn = li.find_all('span',{'class': 'search_now_price'}) ​ pp = li.find_all('span',{'class': 'search_pre_price'}) ​ ​ if not len(a) == 0: ​ link = a[0].attrs['href'] ​ title = a[0].attrs['title'].strip() ​ else: ​ link = 'NULL'​ title = 'NULL'​ ​ if not len(pn) == 0: ​ prNow = pn[0].string ​ else: ​ prNow = 'NULL'​ ​ if not len(pp) == 0: ​ prPre = pp[0].string ​ else: ​ prPre = 'NULL'​ sql = "insert into {} (id,title,prNow,prPre,link) values ({},'{}','{}','{}','{}')".format(tableName,count,title,prNow,prPre,link) ​ cursor.execute(sql) ​ print('提示:正在存入当当数据,当前处理id:{}'.format(count),end='') ​ count += 1​ conn.commit() ​ except: ​ pass​ ​class AmazonThread(threading.Thread): ​ def __init__(self,keyword): ​ threading.Thread.__init__(self) ​ self.keyword = keyword ​ ​ def run(self): ​ print('提示:开始爬取亚马逊数据...') ​ count = 1​ length = getPageLength('Amazon','https://www.amazon.cn/s/keywords={}'.format(self.keyword))#总页数 ​ tableName = 'db_{}_amazon'.format(self.keyword) ​ ​ try: ​ print('提示:正在创建Amazon表...') ​ cursor.execute('create table {} (id int ,title text,prNow text,link text)'.format(tableName)) ​ ​ print('提示:开始爬取亚马逊页面...') ​ for i in range(1,int(length)): ​ url = 'https://www.amazon.cn/s/keywords={}&page={}'.format(self.keyword,i) ​ soup = getSoupObject(url) ​ lis = soup('li',{'id':re.compile(r'result_')}) ​ for li in lis: ​ a = li.find_all('a',{'class':'a-link-normal s-access-detail-page a-text-normal'}) ​ pn = li.find_all('span',{'class': 'a-size-base a-color-price s-price a-text-bold'}) ​ if not len(a) == 0: ​ link = a[0].attrs['href'] ​ title = a[0].attrs['title'].strip() ​ else: ​ link = 'NULL'​ title = 'NULL'​ ​ if not len(pn) == 0: ​ prNow = pn[0].string ​ else: ​ prNow = 'NULL'​ ​ sql = "insert into {} (id,title,prNow,link) values ({},'{}','{}','{}')".format(tableName,count,title,prNow,link) ​ cursor.execute(sql) ​ print('提示:正在存入亚马逊数据,当前处理id:{}'.format(count),end='') ​ count += 1​ conn.commit() ​ except: ​ pass​ ​class JDThread(threading.Thread): ​ def __init__(self,keyword): ​ threading.Thread.__init__(self) ​ self.keyword = keyword ​ def run(self): ​ print('提示:开始爬取京东数据...') ​ count = 1​ ​ tableName = 'db_{}_jd'.format(self.keyword) ​ ​ try: ​ print('提示:正在创建JD表...') ​ cursor.execute('create table {} (id int,title text,prNow text,link text)'.format(tableName)) ​ print('提示:开始爬取京东页面...') ​ for i in range(1,100): ​ url = 'https://search.jd.com/Search?keyword={}&page={}'.format(self.keyword,i) ​ soup = getSoupObject(url) ​ lis = soup('li',{'class':'gl-item'}) ​ for li in lis: ​ a = li.find_all('div',{'class':'p-name'}) ​ pn = li.find_all('div',{'class': 'p-price'})[0].find_all('i') ​ ​ if not len(a) == 0: ​ link = 'http:' + a[0].find_all('a')[0].attrs['href'] ​ title = a[0].find_all('em')[0].get_text() ​ else: ​ link = 'NULL'​ title = 'NULL'​ ​ if(len(link) > 128): ​ link = 'TooLong'​ ​ if not len(pn) == 0: ​ prNow = '¥'+ pn[0].string ​ else: ​ prNow = 'NULL'​ sql = "insert into {} (id,title,prNow,link) values ({},'{}','{}','{}')".format(tableName,count,title,prNow,link) ​ cursor.execute(sql) ​ print('提示:正在存入京东网数据,当前处理id:{}'.format(count),end='') ​ count += 1​ conn.commit() ​ except : ​ pass​def closeDB(): ​ global conn,cursor ​ conn.close() ​ cursor.close() ​ ​def main(): ​ print('提示:使用本程序,请手动创建空数据库:Book,并修改本程序开头的数据库连接语句') ​ keyword = input("提示:请输入要爬取的关键字:") ​ ​ dangdangThread = DangDangThread(keyword) ​ amazonThread = AmazonThread(keyword) ​ jdThread = JDThread(keyword) ​ dangdangThread.start() ​ amazonThread.start() ​ jdThread.start() ​ dangdangThread.join() ​ amazonThread.join() ​ jdThread.join() ​ closeDB() ​ print('爬取已经结束,即将关闭....') ​ os.system('pause') ​ ​main()
    41bdd4cafbfca42e3d9ee782ce95b076.png

    示例截图:

    关键词:Android下的部分运行结果(以导出至Excel)

    da07739d5851c88637d7ffd1516e1b1f.png
    2e6bbbaecd3efd8f2d09b65b309a0589.png
    8e91c706d83a4df6c3171682f09918a0.png

    .

    展开全文
  • scrapy_redis分布式爬虫爬取亚马逊图书 最近在学习分布式爬虫,选取当当图书进行了一次小练习 网址,https://www.amazon.cn/gp/book/all_category/ref=sv_b_0 前期准备 安装 redis 数据库,网上由教程请自行谷歌 ...

    scrapy_redis分布式爬虫爬取亚马逊图书

    • 最近在学习分布式爬虫,选取当当图书进行了一次小练习
    • 网址,https://www.amazon.cn/gp/book/all_category/ref=sv_b_0
    • 前期准备
      • 安装 redis 数据库,网上由教程请自行谷歌
      • 安装 Scrapy 和 scrapy-redis
      • pip install scrapy(如果出现问题请自行谷歌解决,需要vc环境)
      • pip install scrapy-redis

    抓取流程

    • 抓取分类页面下的 “每个大分类下的小分类下的列表页的图书部分内容”
    • 在这里插入图片描述

    主要代码

    • settings
    # -*- coding: utf-8 -*-
    
    # Scrapy settings for amazon project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'amazon'
    
    SPIDER_MODULES = ['amazon.spiders']
    NEWSPIDER_MODULE = 'amazon.spiders'
    
    
    # redis组件
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    SCHEDULER_PERSIST = True
    
    ITEM_PIPELINES = {
        'scrapy_redis.pipelines.RedisPipeline': 400,
    }
    
    REDIS_HOST = "127.0.0.1"
    REDIS_PORT = 6379
    REDIS_PARAMS = {
        'password': 'root',
    }
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    DOWNLOAD_DELAY = 0.5
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'amazon.middlewares.AmazonSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'amazon.middlewares.AmazonDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    #ITEM_PIPELINES = {
    #    'amazon.pipelines.AmazonPipeline': 300,
    #}
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    
    • spiders
    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy_redis.spiders import RedisSpider
    from copy import deepcopy
    
    
    class BookSpider(RedisSpider):
        name = 'book'
        allowed_domains = ['amazon.cn']
        # start_urls = ['http://amazon.cn/']
        redis_key = "amazon_book"
    
        def parse(self, response):
            div_list = response.xpath('//div[@id="content"]/div[@class="a-row a-size-base"]')
    
            for div in div_list:
                item = {}
    
                item['first_title'] = div.xpath('./div[1]/h5/a/@title').extract_first()
                td_list = div.xpath('./div[2]//td')
    
                for td in td_list:
                    item['second_title'] = td.xpath('./a/@title').extract_first()
                    item['second_url'] = td.xpath('./a/@href').extract_first()
    
                    if item['second_url']:
                    	# 有一个url不完整,所以需要判断一下
                        if "http://www.amazon.cn/" in item['second_url']:
                            yield scrapy.Request(
                                url=item['second_url'],
                                callback=self.parse_book_list,
                                meta={'item': deepcopy(item)}
                            )
    
        def parse_book_list(self, response):
            item = response.meta['item']
    
            li_list = response.xpath('//div[@id="mainResults"]/ul/li')
    
            for li in li_list:
                item['book_name'] = li.xpath('.//div[@class="a-row a-spacing-small"]/div[1]/a/@title').extract_first()
                item['book_author'] = li.xpath('.//div[@class="a-row a-spacing-small"]/div[2]/span/text()').extract()
                item['book_type'] = li.xpath('.//div[@class="a-column a-span7"]/div[@class="a-row a-spacing-none"][1]//text()').extract_first()
                item['book_price'] = li.xpath('.//div[@class="a-column a-span7"]/div[@class="a-row a-spacing-none"][2]/a//text()').extract_first()
    
                print(item)
    
            # 翻页
            next_url = response.xpath('(//a[text()="下一页"]|//a[@title="下一页"])/@href').extract_first()
            if next_url:
                next_url = "https://www.amazon.cn" + next_url
    
                yield scrapy.Request(
                    url=next_url,
                    callback=self.parse_book_list,
                    meta={'item': item}
                )
    
    
    
    • 执行

      • Master端输入 lpush amazon_book "https://www.amazon.cn/gp/book/all_category/ref=sv_b_0"
      • Slaver端输入 scrapy crawl book
    • 部分执行结果

    • 在这里插入图片描述

    小结

    • 亚马逊的网址的html都很有规律性,但是就是这种规律性给xpath提取网页数据带来了一些麻烦。
    • 大部分url都是完整的,但是”Kindle今日特价书“板块的url不是完整的,
    • 感觉使用crawlspider的方法来爬取这个网站可能会更好一些。
    展开全文
  • Python亚马逊图书爬虫

    千次阅读 2018-02-23 15:54:51
    encoding=utf8 import requests import time ...article=str(article).replace('亚马逊编辑推荐:','') print article def main(): html=get_detail() parse_detail(html) if name == 'main': main()

    encoding=utf8
    
    import requests
    import time
    from requests.exceptions import RequestException
    import urllib
    from pyquery import PyQuery as pq
    import json
    import re
    from bs4 import BeautifulSoup
    from config import *
    def get_detail():
    times=int(time.time())
    datas = {
    'ref_':'dp_apl_pc_loaddesc',
    'asin':'B00JZ96ZI8',
    'cacheTime':times,
    'merchantId':'A1AJ19PSB66TGU',
    'deviceType':'web'
    }
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
    }
    url='https://www.amazon.cn/gp/product-description/ajaxGetProuductDescription.html?'+urllib.urlencode(datas)
    #必须要加请求头
    response = requests.get(url, headers=headers)
    try:
    if response.status_code == 200:
    return response.text
    except RequestException:
    print u'请求索引页出错'
    #return None
    def parse_detail(html):
    #生成BeautifulSoup对象并使用lxml解析
    soup = BeautifulSoup(html, 'lxml')
    #获取目录
    directory = soup.select('#s_content_4 > p')[0]
    #获取编辑推荐
    article = soup.select('#s_content_0 > p')[0]
    article=str(article).replace('亚马逊编辑推荐:','')
    print article
    def main():
    html=get_detail()
    parse_detail(html)
    if name == 'main':
    main()


    展开全文
  • 1 准备工作 打开docker,运行splash docker run -p 8050:8050 scrapinghub/splash 确定抓取目标 新建数据库的表 2 建立项目
  • 亚马逊有反爬虫机制,header中至少要加入一个信息,此例中加入UA,不过仍然时常不好使,需要重复尝试。 # _*_coding:utf-8_*_ # created by Zhang Q.L.on 2018/5/7 0007 from atexit import reg...
  • closeDB() print('\n爬取已经结束,即将关闭....') os.system('pause') main() 示例截图: 关键词:Android下的部分运行结果(以导出至Excel) 总结 以上就是本文关于Python爬取当当、京东、亚马逊图书信息代码实例的...
  • scrapy爬虫抓取京东图书评论
  • #获取该关键字在图书网站上的总页数 def getPageLength(webSiteName,url): try: soup = getSoupObject(url) if webSiteName == 'DangDang': a = soup('a',{'name':'bottom-page-turn'}) return a[-1].string ...
  • 本文主要介绍了利用 Python 来爬取当当、京东、亚马逊平台上的图书信息代码实例,具有一定借鉴价值,正在学习Python爬虫或正好有需要的朋友可以参考下。注:1.本程序采用 MSSQLserver 数据库存储,请运行程序前手动...
  • 总结 以上就是本文关于Python爬取当当、京东、亚马逊图书信息代码实例的全部内容,希望对大家有所帮助。感兴趣的朋友可以继续参阅本站: 如有不足之处,欢迎留言指出。感谢朋友们对本站的支持!
  • 转载于:https://blog.51cto.com/catinsunshine/715929
  • 即将关闭....') os.system('pause') main() 示例截图: 关键词:Android下的部分运行结果(以导出至Excel) 总结 以上就是本文关于Python爬取当当、京东、亚马逊图书信息代码实例的全部内容,希望对大家有所帮助。...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 565
精华内容 226
关键字:

亚马逊图书