精华内容
下载资源
问答
  • 亚马逊爬虫

    2018-04-24 17:28:42
    亚马逊爬虫爬取商品评论价格 等信息保存成CSV格式。。。。
  • 亚马逊爬虫 爬虫需求:根据某个关键词,用爬虫抓取亚马逊商品的内容,并写入数据库 1.逻辑,获取列表页的个数, 2.构造并获取列表页的url 3.构造获取详情页url列表的url 4.从详情页抽取需要的字段 ```python import...

    亚马逊爬虫

    爬虫需求:根据某个关键词,用爬虫抓取亚马逊商品的内容,并写入数据库

    1.逻辑,获取列表页的个数,

    2.构造并获取列表页的url

    3.构造获取详情页url列表的url

    4.从详情页抽取需要的字段

    import requests
    from lxml import etree
    import urllib3
    import time
    from Database import Database
    import socket
    import random
    import json
    import ssl
    import os
    
    ssl._create_default_https_context = ssl._create_unverified_context
    urllib3.disable_warnings()
    headers = {
        "authority": "www.amazon.com",
        "referer": "https://www.amazon.com/",
        "cookie": "session-id=135-9270034-7902044; session-id-time=2082787201l; i18n-prefs=USD; ubid-main=133-6329801-5373634; lc-main=en_US; x-amz-captcha-1=1611561400323067; x-amz-captcha-2=A5mJb102s77jmJPXHmDTkw==; session-token=NglmrU6O168Bqrx5lTGDGYMT/SEPDr9oHKh6tOadX2whsc9nbcGpv0Sq6IbsWH3HsZeM0356/n/4hEMfVHaSRZp9AbitEPua6hu2BJqjWUum8UbFtF0lPXlS0dBb4RdzqFtuQY038nDZ4HGb5ELj/13C2LDghkkYrJ8r8efe8FR2CctuJFol/fN11G5PIAQi; skin=noskin; csm-hit=tb:G1YDRWT651WPV3D9SX2V+s-Y79DQEGWVSS7F4EEJ7AR|1611623253884&t:1611623253884&adb:adblk_yes",
        "rtt": "150",
        "downlink": "9.7",
        "ect": "4g",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75",
        "path": "/s?k=canvas+print&ref=nb_sb_noss_1",
    }
    
    
    # 根据关键词构建url
    def get_url(keyword):
        url = "https://www.amazon.com/s?k=" + keyword.replace(" ", "+") + "&page="
        return url
    
    
    # 获取列表页个数
    def get_page_url_num(url):
        res = requests.get(url, headers=headers, verify=False, timeout=100)
        html = etree.HTML(res.text)
        num = int(html.xpath('//ul[@class="a-pagination"]//li[last()-1]//text()')[0])
        return num
    
    
    # 分页url 列表
    def get_page_url_list(start_num, end_num, keyword):
        basic_url = "https://www.amazon.com/s?k=" + keyword.replace(" ", "+") + "&page="
        # basic_url = "https://www.amazon.com/s?k=water+shoes&page="
        page_url_list = []
        for i in range(start_num, end_num):
            url = basic_url + str(i + 1)
            page_url_list.append(url)
        return page_url_list
    
    
    # 获取详情页url列表
    def get_detail_url_dict(page_url_list):
        detail_url_list = []
        imag_url_list = []
        basic = "https://www.amazon.com"
        exception_detail_url_list = []
        for i in page_url_list:
            try:
                res = requests.get(i, headers=headers, verify=False, timeout=100)
                # print(res.text)
                time.sleep(random.randint(3, 10))
                res.close()
                socket.setdefaulttimeout(30)
                html = etree.HTML(res.text)
                detail_url = html.xpath(
                    '//div[@class="a-section a-spacing-none a-spacing-top-small"]//a[@class="a-link-normal a-text-normal"]/@href')
                detail_url = [basic + i for i in detail_url]
                detail_url_list.append(detail_url)
                #             print(imag_url_list)
                imag_url = html.xpath(
                    '//div[@class="a-section aok-relative s-image-tall-aspect"]/img/@src')
                if len(imag_url) != 0:
                    pass
                else:
                    imag_url = html.xpath('//div[@class="a-section aok-relative s-image-square-aspect"]/img/@src')
                imag_url_list.append(imag_url)
            #             print(detail_url_list)
            #             imag_file_name_list = html.xpath('//div[@class="a-section a-spacing-none a-spacing-top-small"]//h2//span/text()')
            #             img_dir = dict(zip(imag_file_name_list,imag_url_list))
            #         return img_dir
            except Exception as e:
                exception_detail_url_list.append(detail_url)
                print(e)
                print(detail_url)
            continue
        detail_url_list = [i for k in detail_url_list for i in k]
        imag_url_list = [i for k in imag_url_list for i in k]
        detail_url_dict = dict(zip(detail_url_list, imag_url_list))
        return detail_url_dict
    
    
    # 获取详情页数据
    def get_data(url):
        response = requests.get(url, headers=headers, verify=False, timeout=100)
        response.close()
        socket.setdefaulttimeout(30)
        time.sleep(random.randint(3, 10))
        detail_html = etree.HTML(response.text)
        # 标题
        title = detail_html.xpath('//span[@id="productTitle"]/text()')[0].strip()
        # 评分
        #     score = detail_html.xpath('//span[@id="acrPopover"]//i/span[1]/text()')[0][:3]
        score = detail_html.xpath('//span[@id="acrPopover"]//i/span[1]/text()')
        if len(score) != 0:
            score = score[0][:3]
        else:
            score = detail_html.xpath('//span[@class="a-size-medium a-color-base"]//text()')
            if len(score) != 0:
                score = score[0][:3]
            else:
                score = 0
        # 评分数量
        #     score_num = int(
        #         detail_html.xpath('//span[@id="acrCustomerReviewText"]/text()')[0].replace(r",", "").replace("ratings",
        #                                                                                                      "").replace(
        #             "rating", "").strip())
        score_num = detail_html.xpath('//span[@id="acrCustomerReviewText"]/text()')
        if len(score_num) != 0:
            score_num = int(score_num[0].replace(r",", "").replace("ratings", "").replace("rating", "").strip())
        else:
            # score_num = detail_html.xpath('//span[@class="a-size-base a-color-secondary"]//text()')[0].replace(
            #     "global ratings", "").replace("global rating", "").strip()
            # score_num = int(score_num)
            score_num = detail_html.xpath('//span[@class="a-size-base a-color-secondary"]//text()')
            if len(score_num) != 0:
                score_num = score_num[0].replace("global ratings", "").replace("global rating", "").strip()
                score_num = int(score_num)
            else:
                score_num = 0
        # 价格
        price = detail_html.xpath('//span[@id="priceblock_ourprice"]/text()')
        if len(price) != 0:
            price = price[0]
        else:
            price = detail_html.xpath('//span[@id="priceblock_saleprice"]/text()')
            if len(price) != 0:
                price = price[0]
            else:
                price = detail_html.xpath('//span[@id="priceblock_dealprice"]/text()')[0]
        # 颜色
        # color = detail_html.xpath(
        #     '//ul[@class="a-unordered-list a-nostyle a-button-list a-declarative a-button-toggle-group a-horizontal a-spacing-top-micro swatches swatchesRectangle imageSwatches"]//li//img/@alt')
        # 品牌
        brand = detail_html.xpath('//a[@id="bylineInfo"]/text()')[0].replace("Visit the ", "").replace(" Store",
                                                                                                       "").replace(
            "Brand: ", "").strip()
        # 上架时间
        product_info_key = detail_html.xpath('//div[@id="detailBullets_feature_div"]//li/span//span[1]//text()')
        product_info_key = [j.replace("\n", "").replace(":", "") for j in product_info_key]
        product_info_values = detail_html.xpath('//div[@id="detailBullets_feature_div"]//li/span//span[2]//text()')
        str_xpath = '//table[@id="productDetails_detailBullets_sections1"]//tr[{}]/td/text()'
        str_xpath_list = []
        if len(product_info_key) == 0:
            product_info_key = detail_html.xpath('//table[@id="productDetails_detailBullets_sections1"]//tr/th//text()')
            product_info_key = [j.replace("\n", "").replace(":", "") for j in product_info_key]
            print(product_info_key)
            for i in range(len(product_info_key)):
                str_xpath_list.append(str_xpath.format(str(i + 1)))
                product_info_values.append(detail_html.xpath(str_xpath.format(str(i + 1))))
            product_info_values = [i[len(i) - 1].replace("\n", "").replace("out of 5 stars", "").strip() for i in
                                   product_info_values]
        product_info = dict(zip(product_info_key, product_info_values))
        if "Date First Available" in product_info.keys():
            publish_time = product_info["Date First Available"]
            publish_time = publish_time.replace(" ", "_").replace(",", "").replace("January", "01").replace("February",
                                                                                                            "02").replace(
                "March", "03").replace("April", "04").replace("May", "05").replace("June", "06").replace("July",
                                                                                                         "07").replace(
                "August", "08").replace("September", "09").replace("October", "10").replace("November", "11").replace(
                "December", "12").split("_")
            publish_time = str(publish_time[2] + "/" + publish_time[0] + "/" + publish_time[1])
        else:
            publish_time = ""
        if "ASIN" in product_info.keys():
            asin = product_info["ASIN"]
        else:
            asin = ""
        #     print("ASIN:"+asin)
        if "Product Dimensions" in product_info.keys():
            product_dimensions = product_info["Product Dimensions"]
        else:
            product_dimensions = ""
        #     print("Product Dimensions:"+product_dimensions)
        if "Item model number" in product_info.keys():
            item_model_number = product_info["Item model number"]
        else:
            item_model_number = ""
        if "Department" in product_info.keys():
            department = product_info["Department"]
        else:
            department = ""
        if "Manufacturer" in product_info.keys():
            manufacturer = product_info["Manufacturer"]
        else:
            manufacturer = ""
        if "Is Discontinued By Manufacturer" in product_info.keys():
            Is_Discontinued_By_Manufacturer = product_info["Is Discontinued By Manufacturer"]
        else:
            Is_Discontinued_By_Manufacturer = ""
        # 详细排名
        # detail_rank_key = detail_html.xpath('//span[@class="zg_hrsr_ladder"]//a//text()')
        # detail_rank_values = detail_html.xpath('//ul[@class="zg_hrsr"]//li//span[@class="zg_hrsr_rank"]//text()')
        # detail_rank_values = [int(i.replace("#", "")) for i in detail_rank_values]
        # detail_rank = dict(zip(detail_rank_key, detail_rank_values))
        # 大类排名
        main_rank_key = detail_html.xpath('//li[@id="SalesRank"]/text()')
        if len(main_rank_key) != 0:
            main_rank_key = [i.replace("\n", "").replace(r"(", "").replace(r")", "").strip() for i in main_rank_key]
            while "" in main_rank_key:
                main_rank_key.remove("")
            main_rank_key = main_rank_key[0].split(" ")
            main_rank_values = main_rank_key[0].replace("#", "")
            main_rank_key = main_rank_key[1:]
            main_rank_key = " ".join(main_rank_key)
            rank = int(main_rank_values.replace(",", ""))
            rank_type = main_rank_key.replace("in", "").strip()
        else:
            main_rank_key = detail_html.xpath(
                '//ul[@class="a-unordered-list a-nostyle a-vertical a-spacing-none detail-bullet-list"][1]//li[1]/span[1]/text()')
            if len(main_rank_key) != 0:
                main_rank_key = [i.replace("\n", "").replace(r"(", "").replace(r")", "").strip() for i in main_rank_key]
                while "" in main_rank_key:
                    main_rank_key.remove("")
                main_rank_key = main_rank_key[0].split(" ")
                main_rank_values = main_rank_key[0].replace("#", "")
                main_rank_key = main_rank_key[1:]
                main_rank_key = " ".join(main_rank_key)
                rank = int(main_rank_values.replace(",", ""))
                rank_type = main_rank_key.replace("in", "").strip()
            else:
                main_rank_key = detail_html.xpath(
                    '//table[@id="productDetails_detailBullets_sections1"]//tr/td//span/span[1]/text()')
                if len(main_rank_key) != 0:
                    main_rank_key = [i.replace("\n", "").replace(r"(", "").replace(r")", "").strip() for i in main_rank_key]
                    while "" in main_rank_key:
                        main_rank_key.remove("")
                    main_rank_key = main_rank_key[0].split(" ")
                    main_rank_values = main_rank_key[0].replace("#", "")
                    main_rank_key = main_rank_key[1:]
                    main_rank_key = " ".join(main_rank_key)
                    rank = int(main_rank_values.replace(",", ""))
                    rank_type = main_rank_key.replace("in", "").strip()
                else:
                    rank = 0
                    rank_type = ""
        #     print("类目:"+rank_type)
        #     feature_key = detail_html.xpath('//div[@id="cr-summarization-attributes-list"]//div[@class="a-fixed-right-grid a-spacing-base"]')[0:3]
        #     feature_value = detail_html.xpath('//div[@id="cr-summarization-attributes-list"]//div[@class="a-fixed-right-grid a-spacing-base"]//text()')
        #     feature =dict(zip(feature_key,feature_value))
        info_list = [
            title, score, score_num, price, brand, publish_time, asin, product_dimensions, item_model_number,
            department,
            manufacturer, Is_Discontinued_By_Manufacturer, rank, rank_type, url]
        return info_list
    
    
    if __name__ == '__main__':
        start_time = time.time()
        keyword = "wall tapestry"
        url = get_url(keyword)
        #     print(url)
        #     num = get_page_url_num(url)
        #     print(num)
        sql = "insert into wall_tapestry( title, score, score_num, price, brand, publish_time, asin, product_dimensions, item_model_number,department,manufacturer,is_discontinued_by_Manufacturer , main_rank, rank_type, url,search_rank,picture_url) values(% s, % s, % s, % s, % s, % s, % s, % s, % s, % s, % s, % s, % s, % s, % s, % s, % s)"
        #     inser_img_sql = "insert into exercise_ball(image) values( %s)"
        db = Database()
        page_url_list = get_page_url_list(0, 7, keyword)
        print(page_url_list)
        detail_url_dict = get_detail_url_dict(page_url_list)
        detail_url_list = list(detail_url_dict.keys())
        print(detail_url_list)
        # 没有正确抓取的详情页列表
        exception_url_list = []
        exception_question = []
        for i in detail_url_list:
            try:
                # 主图图片地址
                picture_url = detail_url_dict[i]
                # 链接在坑位的哪个位置
                search_rank = detail_url_list.index(i) + 1
                info_list = get_data(i)
                info_list.append(search_rank)
                info_list.append(picture_url)
                tuple1 = tuple(info_list)
                print(tuple1)
                db.insert(sql, tuple1)
            except Exception as e:
                exception_url_list.append(i)
                exception_question.append(e)
                print(i)
                print(e)
            continue
        exception = dict(zip(exception_url_list, exception_question))
        print(exception)
        # 关闭连接
        db.close()
        end_time = time.time()
        print("累计花费:" + str(round((end_time - start_time) / 60)) + "分钟")
    
    
    
    展开全文
  • 使用「Requests」+「bs4」写亚马逊爬虫终于我们还是讲到用「Python」来爬数据了。有些卖家就问了,为什么要用pytho?之前不是已经有一些Chrome插件或者其他简便的方法了吗?是的没错,但是他们都还达不到指哪儿爬...

    使用「Requests」+「bs4」写亚马逊爬虫

    终于我们还是讲到用「Python」来爬数据了。有些卖家就问了,为什么要用pytho?之前不是已经有一些Chrome插件或者其他简便的方法了吗?是的没错,但是他们都还达不到指哪儿爬哪儿、无惧目标网站封杀的水平呀。

    作为已经成为最受欢迎的程序设计语言之一「Python」,它除了具有丰富和强大的库之外,还被赋予“胶水语言”的昵称,毕竟它能够把用其他语言制作的各种模块(尤其是C/C++)很轻松地联结在一起。用它来写爬虫我们就是「站在巨人的肩膀上」,很多东西并不需要我们写,只需要库里拿过来用就行了。

    话不多说,接下来小编就来叫大家如何操作!

    环境搭建步骤:一、安装Python:在这里我们使用python 3.6.6版本,可在下面连接中直接下载。

    Windows 版本:

    https://www.python.org/ftp/python/3.6.6/python-3.6.6.exe

    MacOS版本:

    https://www.python.org/ftp/python/3.6.6/python-3.6.6-macosx10.9.pkg

    其他版本请访问python官网:

    https://www.python.org/downloads/release/python-366/

    首先将「Add Python 3.6 to PATH」勾选上,点击「Customize installation」。

    b773f91bc0e04fbf07164bff394215df.png

    在将「Install for all users」勾选上,点击「install」。

    2f128336debc36e51ce6fc3886b98a0f.png

    安装完成后,我们来检查一下Python是否安装成功,打开cmd命令,输入python回车,若显示类似下图,证明环Python安装成功。

    f6660707a5fed69f18c24fcab35ac056.png二、安装PyCharm:

    PyChram是一款提供Python开发环境的应用程序,可以帮助我们更好的编写、调试代码。

    Windows版本:

    https://download.jetbrains.com/python/pycharm-professional-2018.2.exe

    MacOS版本:

    https://download.jetbrains.com/python/pycharm-professional-2018.2.dmg

    具体安装步骤可参考:

    https://www.cnblogs.com/dcpeng/p/9031405.html

    下载完成后双击打开Pychram安装包,傻瓜化安装,基本一路next。

    三、配置PyCharm:打开pycharm,按下列图片完成配置。b44e9a6cfc0dd23aca0e4edf39939d6d.png19f78a49c862feedd9ba44c8e485b31c.png85cd03116eb98a25e2899d0f0928804d.png

    四、创建新项目d9a18622a557efc70554d858fb61e965.pngf2cfc80eecec9de602718b730d338c53.png2e595db8bc8fe126064eae5388462b36.png60896b4006f951b6464cca623a1f76d5.pnga7c4f652ebf6e9e035af241b5a334d12.png6a6264f7485c2469f55ecad1d270a709.png7fe8103c0bc3ad77eea50e5d2b6d9a6c.png

    以上就是PyCharm + Python3.6环境的搭建步骤。

    编写第一个爬虫爬虫技术需要循序渐进,今天我们先爬一些简单的东西来方便大家理解,为后期爬取亚马逊数据做准备。

    分析目标网站

    今天我们选择的目标网站是【亚马逊美国站】https://www.amazon.com。首先我们使用Chrome浏览器打开该网站,搜索关键字「iphone」并分析该网站的网页结构。

    de5b9d4d95a73bd3228dcc91916dbecb.png

    分析后不难看出该网站目标数据处的网页结构:

    01df7d5404b57c448d0090fec1ea4b77.png

    id为s-results-list-atf的

    • 标签包含有数个
    • 标签,每一个
    • 标签包含了每件商品的一些信息。因此我们只需要请求网页数据,拿到id为s-results-list-atf的
      • 标签源码,然后自己解析、取出自己想要的数据即可。

      一个爬虫主要分为四个部分:一、请求数据:

      首先我们安装python中的「Requests」。

      在我们刚才创建的「black_Friday」中 输入:importrequests# 导入requestsfrombs4importBeautifulSoup# 从bs4中导入BeautifulSoup

      e45042aa4289b98b73b528c2e070809e.png

      光标停留在有红底波浪线的requests上按「Alt」+「Enter」然后选择「Install  package requests」等待模块安装完成后红色波浪线会消失。

      以同样的方式安装「bs4」模块。url ='https://www.amazon.com/s/keywords=iphone'response = requests.get(url)

      二、拿到数据

      「response」是一个变量名,用来保存目标网站返回给我们的数据。

      可使用下面代码在控制台打印出目标网站返回的数据。print(response.text)

      9e1c410598bf997725692bc52d7d1180.png

      三、解析数据返回的数据看起来乱七八糟的怎么办?这么大一堆怎么找到想要的数据呢?这就要用到刚才导入的「bs4」模块了。Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式,Beautiful Soup会帮你节省数小时甚至数天的工作时间。

      首先生成一个「BeautifulSoup」对象,我们命名为:response_soup:response_soup = BeautifulSoup(response.text,'html.parser')

      其中「response.text」表示返回的数据,「html.parser」表示解析的方式。result_list = response_soup.find('ul',id='s-results-list-atf').find_all("li")

      在response_soup中找到id为s-results-list-atf的

      • 标签,    再在其中寻找所有的
      • 标签。forliinresult_list:print(li)print("="*60)

        可以用遍历的方式打印每个

      • 标签,看是否与我们想要的数据一致。

        52cd8347bc936753e2390ef68d734eaa.png

        1、ASIN

        配合Chrome我们可以看出每个

      • 标签的“data-asin”即为商品的「ASIN」。

        ddc9eef5d1474d94badf5c3193f2019b.png

        asin = li['data-asin']

        这样即可取出每件商品的「ASIN」。

        2、Price

        daebe33b83c443de396992feb7446bfe.png

        商品的价格是写在一个class为a-size-base a-color-base的标签中。找出该标签,取出标签中的文本即可找出价格。price = li.find('span','a-size-base a-color-base').text

        3、Star

        4e0677d8ac5d5d5446fcfa22ee44523f.png

        商品的star是写在一个class为a-icon-alt的标签中。找出该标签,取出标签中的文本即可找出价格。

        star = li.find('span','a-icon-alt').text

        这样我们便爬到了一页中所有产品的Asin、Price、Star。

        四、保存数据

        使用csv库,将爬到的数据以csv格式保存下来。importcsv # 导入csv库

        定义一个列表,用来保存每件商品的数据。info_list = []

        将Asin、Price、Star添加到列表中。info_list.append(asin)

        info_list.append(price)

        info_list.append(star)

        打开csv文件(若当前路径下没有改文件,将自动创建)。这里命名csv文件为“iPhone.csv”csvFile =open('./iphone.csv','a',newline='')

        创建写入对象、写入数据并关闭csv文件。writer = csv.writer(csvFile)

        writer.writerow(info_list)

        csvFile.close()

        完整代码:importrequests# 导入requestsfrombs4importBeautifulSoup# 从bs4中导入BeautifulSoupimportcsv

        url ='https://www.amazon.com/s/keywords=iphone'response = requests.get(url)

        response_soup = BeautifulSoup(response.text,'html.parser')

        result_list = response_soup.find('ul',id='s-results-list-atf').find_all("li")forliinresult_list:

        info_list = []try:

        price = li.find('span','a-offscreen').textexcept:

        price = li.find('span','a-size-base a-color-base').text

        asin = li['data-asin']

        star = li.find('span','a-icon-alt').textprint(asin)print(price)print(star)

        info_list.append(asin)

        info_list.append(price)

        info_list.append(star)

        csvFile =open('./iphone.csv','a',newline='')

        writer = csv.writer(csvFile)

        writer.writerow(info_list)

        csvFile.close()print("="*60)

        运行效果:cdd5fb95e152c84dc42b35bcf112c915.png4afe1a299b4d0863218674e7030421cc.png07218f2f606cee914528a722f5d52452.gif

        以上就是本期的爬虫文章,谢谢阅读。

        附 「Requests」&「bs4」的中文操作文档:

        Requests:

        http://docs.python-requests.org/zh_CN/latest/user/quickstart.html

        bs4:

        http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

        ▼更多精彩推荐,请关注我们▼

    展开全文
  • 我在2018年入职一家外贸公司后,一直从事爬虫工作,接触了很多抓取亚马逊信息的需求。但是在这个过程中,我发现市场上很多公司有爬取数据的需求,考虑到一个开发和维护一个专业爬虫系统成本很高,同时市场上也没有一...

    我在2018年入职一家外贸公司后,一直从事爬虫工作,接触了很多抓取亚马逊信息的需求。但是在这个过程中,我发现市场上很多公司有爬取数据的需求,考虑到一个开发和维护一个专业爬虫系统成本很高,同时市场上也没有一个符合客户需求的爬虫工具。

    所以我产生了开发一个专注于亚马逊数据爬取工具的想法,对比了一些开发方式,独立的爬虫要很多代理进行强制抓数据最近接触到了chrome 插件开发,发现这个可以让运用本地的cookie,在很多方面会比独立运行的爬虫更方便的绕过亚马逊反爬。

    目前第一个版本只支持US站点的评论信息,后续有时间也会更新支持更多。

    使用说明

    1. 环境要求

    使用chrome 或者 360极速浏览器

    2. 步骤

    • 打开开发者模式

    在浏览器中进入 chrome://extensions/ 地址 ,将开发者模式打开 。
    chrome: 打开用会出现应用ID

    947a1a6271af55a2743809ea089bc9dc.png
    • 使用

    进入 https://www.amazon.com/

    多个asin 直接贴上去就可以了

    1f2e721aea15c7d134aa8afc9b313d09.png
    • 问题

    如果asin 比较多下载会比较慢,但是下面会有进度提醒。

    如果出现无法点击数据下载按钮,请点击刷新按钮。

    链接:https://pan.baidu.com/s/1laXCkEi6lLJAUU49LqlB-w 密码:ulg5

    开发者:
    ybbzbb
    Simi

    展开全文
  • 亚马逊爬虫-python

    千次阅读 2018-08-04 22:29:12
    找实习遇到的作业: 最终结果:    实现代码分两部分:抓取书籍id,爬取详细数据 1: import requests import re from pyquery import PyQuery as pq #提取一个代理 def get_proxy(): ...).cont...

    找实习遇到的作业:

    最终结果:

     

     实现代码分两部分:抓取书籍id,爬取详细数据

    1:

    import requests
    import re
    from pyquery import PyQuery as pq

    #提取一个代理
    def get_proxy():
        return str(requests.get("http://127.0.0.1:5010/get/").content)[2:-1]

    #使用代理的requests请求
    def url_open(url):
        header = {'User-Agent': 'Mozilla/5.0 ', 'X-Requested-With': 'XMLHttpRequest'}
        global proxy
        try:
            if proxy:
                print('正在使用代理', proxy)
                proxies = {'http':'http://'+proxy}
                #print(proxies)
                response = requests.get(url=url, headers=header, proxies=proxies)
            else:
                response = requests.get(url=url, headers=header)
            if response.status_code == 200:
                return response.text
            if response.status_code == 503:
                print('503')
                proxy = get_proxy()
                if proxy:
                    return url_open(url)
                else:
                    print('请求代理失败')
                    return None
        except Exception:
            proxy=get_proxy()
            return url_open(url)

    ###########文学分类入口链接提取################
    html='href="/s/ref=lp_144180071_nr_n_0fst=as%3Aoff&amp;rh=n%3A116087071%2Cn%3A%21116088071%2Cn%3A116169071%2Cn%3A144180071%2Cn%3A144201071&amp;bbn=144180071&amp;ie=UTF8&amp;qid=1533176532&amp;rnid=144180071"><span class="a-size-small a-color-base">文学名家</span></a></span></li><li><span class="a-list-item"><a class="a-link-normal s-ref-text-link" href="/s/ref=lp_144180071_nr_n_1?href="/s/ref=lp_144180071_nr_n_1?fst=as%3Aoff&amp;rh=n%3A116087071%2Cn%3A%21116088071%2Cn%3A116169071%2Cn%3A144180071%2Cn%3A144206071&amp;bbn=144180071&amp;ie=UTF8&amp;qid=1533176532&amp;rnid=144180071"><span class="a-size-small a-color-base">作品集</span></a></span></li><li><span class="a-list-item"><a class="a-link-normal s-ref-text-link" href="/s/ref=lp_144180071_nr_n_2?fst=as%3Aoff&amp;rh=n%3A116087071%2Cn%3A%21116088071%2Cn%3A116169071%2Cn%3A144180071%2Cn%3A144212071&amp;bbn=144180071&amp;ie=UTF8&amp;qid=1533176532&amp;rnid=144180071"><span class="a-size-small a-color-base">散文随笔</span></a></span></li><li><span class="a-list-item"><a class="a-link-normal s-ref-text-link" href="/s/ref=lp_144180071_nr_n_3?fst=as%3Aoff&amp;rh=n%3A116087071%2Cn%3A%21116088071%2Cn%3A116169071%2Cn%3A144180071%2Cn%3A144222071&amp;bbn=144180071&amp;ie=UTF8&amp;qid=1533176532&amp;rnid=144180071"><span class="a-size-small a-color-base">诗歌词曲</span></a></span></li><li><span class="a-list-item"><a class="a-link-normal s-ref-text-link" href="/s/ref=lp_144180071_nr_n_4?fst=as%3Aoff&amp;rh=n%3A116087071%2Cn%3A%21116088071%2Cn%3A116169071%2Cn%3A144180071%2Cn%3A144235071&amp;bbn=144180071&amp;ie=UTF8&amp;qid=1533176532&amp;rnid=144180071"><span class="a-size-small a-color-base">民间文学</span></a></span></li><li><span class="a-list-item"><a class="a-link-normal s-ref-text-link" href="/s/ref=lp_144180071_nr_n_5?fst=as%3Aoff&amp;rh=n%3A116087071%2Cn%3A%21116088071%2Cn%3A116169071%2Cn%3A144180071%2Cn%3A144228071&amp;bbn=144180071&amp;ie=UTF8&amp;qid=1533176532&amp;rnid=144180071"><span class="a-size-small a-color-base">纪实文学</span></a></span></li><li><span class="a-list-item"><a class="a-link-normal s-ref-text-link" href="/s/ref=lp_144180071_nr_n_6?fst=as%3Aoff&amp;rh=n%3A116087071%2Cn%3A%21116088071%2Cn%3A116169071%2Cn%3A144180071%2Cn%3A144218071&amp;bbn=144180071&amp;ie=UTF8&amp;qid=1533176532&amp;rnid=144180071"><span class="a-size-small a-color-base">影视文学</span></a></span></li><li><span class="a-list-item"><a class="a-link-normal s-ref-text-link" href="/s/ref=lp_144180071_nr_n_7?fst=as%3Aoff&amp;rh=n%3A116087071%2Cn%3A%21116088071%2Cn%3A116169071%2Cn%3A144180071%2Cn%3A144234071&amp;bbn=144180071&amp;ie=UTF8&amp;qid=1533176532&amp;rnid=144180071"><span class="a-size-small a-color-base">戏剧与曲艺</span></a></span></li><li><span class="a-list-item"><a class="a-link-normal s-ref-text-link" href="/s/ref=lp_144180071_nr_n_8?fst=as%3Aoff&amp;rh=n%3A116087071%2Cn%3A%21116088071%2Cn%3A116169071%2Cn%3A144180071%2Cn%3A144200071&amp;bbn=144180071&amp;ie=UTF8&amp;qid=1533176532&amp;rnid=144180071"><span class="a-size-small a-color-base">文学史</span></a></span></li><li><span class="a-list-item"><a class="a-link-normal s-ref-text-link" href="/s/ref=lp_144180071_nr_n_9?fst=as%3Aoff&amp;rh=n%3A116087071%2Cn%3A%21116088071%2Cn%3A116169071%2Cn%3A144180071%2Cn%3A144181071&amp;bbn=144180071&amp;ie=UTF8&amp;qid=1533176532&amp;rnid=144180071"><span class="a-size-small a-color-base">文学理论</span></a></span></li><li><span class="a-list-item"><a class="a-link-normal s-ref-text-link" href="/s/ref=lp_144180071_nr_n_10?fst=as%3Aoff&amp;rh=n%3A116087071%2Cn%3A%21116088071%2Cn%3A116169071%2Cn%3A144180071%2Cn%3A144187071&amp;bbn=144180071&amp;ie=UTF8&amp;qid=1533176532&amp;rnid=144180071"><span class="a-size-small a-color-base">文学评论与鉴赏</span></a></span></li><li><span class="a-list-item"><a class="a-link-normal s-ref-text-link" href="/s/ref=lp_144180071_nr_n_11?fst=as%3Aoff&amp;rh=n%3A116087071%2Cn%3A%21116088071%2Cn%3A116169071%2Cn%3A144180071%2Cn%3A144242071&amp;bbn=144180071&amp;ie=UTF8&amp;qid=1533176532&amp;rnid=144180071"><span class="a-size-small a-color-base">期刊杂志</span></a></span></li><li><span class="a-list-item"><a class="a-link-normal s-ref-text-link" href="/s/ref=lp_144180071_nr_n_12?fst=as%3Aoff&amp;rh=n%3A116087071%2Cn%3A%21116088071%2Cn%3A116169071%2Cn%3A144180071%2Cn%3A144243071&amp;bbn=144180071&amp;ie=UTF8&amp;qid=1533176532&amp;rnid=144180071"><span class="a-size-small a-color-base">文学作品导读'
    doc=pq(html)
    pages_list=[]
    for each in re.findall('rh=(.*?)&amp',html):
        pages_list.append('https://www.amazon.cn/s/rh='+each)

    count=0 #用作txt文件名
    asin_re=re.compile('data-asin="(.*?)" class') #用正则解析book_asin
    for page_url in pages_list:
        print(page_url)
        html = url_open(page_url)
        doc = pq(html)
        if doc('#pagn > span.pagnDisabled').text():
            page_count=int(doc('#pagn > span.pagnDisabled').text())  #解析该类下面有多少页,若出错,则设为400页
        else:page_count=400
        count += 1
        with open(str(count)+'.txt','a',encoding='utf-8')as f:  #创建txt文件
            err_count=0
            for i in range(1, page_count + 1):
                print('正在爬取第%dy页的book_asin' % i)
                url = page_url + '&page='+str(i)
                html = url_open(url)
                print(url)
                if html!=None:
                    err_count=0
                    if err_count>=20:  #在前面解析该类下面有多少页出错导致访问空页面时,超过20次即认为已经爬完该分类,跳出循环
                        break
                    data_asin = re.findall(asin_re, html)
                    print(data_asin)

                    for each in data_asin:    #写入文件
                        f.write(each)
                        f.write('\n')
                else: err_count+=1

     

    2:

    import requests
    from fake_useragent import UserAgent
    import pymysql
    from multiprocessing import Process,Queue,Lock
    from pyquery import PyQuery as pq
    import time
    import random


    ua = UserAgent()   #实例化,后文用它生成随机游览器请求头

    # #调试排查问题所用
    # def get(url,i=2):
    #     headers = {
    #         'Accept': 'text/html,*/*',
    #         'Accept-Encoding': 'gzip, deflate, br',
    #         'Accept-Language': 'zh-CN,zh;q=0.9',
    #         'Connection': 'keep-alive',
    #     
    #         'Host': 'www.amazon.cn',
    #         'Referer': 'https://www.amazon.cn/gp/aw/s/ref=is_pn_1?rh=n%3A658390051%2Cn%3A%21658391051%2Cn%3A658394051%2Cn%3A658509051&page=1',
    #         'User-Agent': ua.random,
    #         'X-Requested-With': 'XMLHttpRequest'
    #     }
    #     if i>0:
    #         try:
    #             response = requests.get(url=url, headers=headers,timeout=1)
    #             print(response.status_code)
    #             response.encoding='utf-8'
    #             return response.text
    #         except :
    #             get(url, i=i - 1)
    #     else:return None

    def get_proxy():
        return str(requests.get("http://127.0.0.1:5010/get/").content)[2:-1]

    def title_parse(title): #由于amazon抓取下来的书籍标题太长,需要截取一下
        jd_title = []
        for each in title:
            if each != "(":
                jd_title.append(each)
            else:
                break

        jd_title = ''.join(jd_title)
        return jd_title

    def price_parse(price):  #处理一下amazon价格
        amazon_price=[]
        for each in price:
            if each != "¥":
                amazon_price.append(each)
            else:
                break

        amazon_price = ''.join(amazon_price)
        return amazon_price

    #亚马逊请求函数
    def url_open1(url):
        header = {
            'Accept': 'text/html,*/*',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'zh-CN,zh;q=0.9',
            'Connection': 'keep-alive',
            'Host': 'www.amazon.cn',
            'Referer': 'https: // www.amazon.cn /',
            'User-Agent': ua.random,
            'X-Requested-With': 'XMLHttpRequest'
        }
        global proxy
        try:
            if proxy:
                print('正在使用代理', proxy)
                proxies = {'http':'http://'+proxy}
                #print(proxies)
                response = requests.get(url=url, headers=header, proxies=proxies)
            else:
                response = requests.get(url=url, headers=header)
            if response.status_code == 200:
                response.encoding='utf-8'
                return response.text
            if response.status_code == 503:
                print('503')
                proxy = get_proxy()
                if proxy:
                    return url_open1(url)
                else:
                    print('请求代理失败')
                    return None
        except Exception:
            proxy=get_proxy()
            return url_open1(url)

    #京东请求函数
    def url_open2(url):
        header = {
            'User-Agent': ua.random,
        }
        global proxy
        try:
            if proxy:
                print('正在使用代理', proxy)
                proxies = {'http': 'http://' + proxy}
                # print(proxies)
                response = requests.get(url=url, headers=header, proxies=proxies)
            else:
                response = requests.get(url=url, headers=header)
            if response.status_code == 200:
                response.encoding = 'utf-8'
                return response.text
            if response.status_code == 503:
                print('503')
                proxy = get_proxy()
                if proxy:
                    return url_open2(url)
                else:
                    print('请求代理失败')
                    return None
        except Exception:
            proxy = get_proxy()
            return url_open2(url)

    #核心的蜘蛛了,承担了解析Amazon和JD详情页以及之后的存储数据功能
    def spider(q,lock):
        #操作MySQL
        conn = pymysql.connect(host='localhost', port=3306, user='root', password='******', db='amazon', charset='utf8')
        cursor = conn.cursor()
        while True:
            lock.acquire()
            asin = q.get(block=False)[:-1]
            lock.release()
            url = 'https://www.amazon.cn/gp/product/{a}'.format(a=asin)
            print(url)
            html = url_open1(url)
            if html==None: #有时候返回None,此语句防崩溃
                continue
            doc = pq(html)
            title = doc('#ebooksProductTitle.a-size-extra-large').text()  # 书名
            amazon_price = doc('a .a-size-small.a-color-price').text()[1:]  # 纸质书价格(人民币)
            amazon_price=price_parse(amazon_price)
            #e_price = doc('#tmmSwatches > ul > li.swatchElement.selected > span > span:nth-child(4) > span > a').text()[1:-2]  # 电子书价格
            amazon_comments = doc('#acrCustomerReviewText.a-size-base').text()[:-5]  # 评论数
            jd_search_title = title_parse(title)
            url = 'https://search.jd.com/Search?keyword={a}&enc=utf-8'.format(a=jd_search_title)
            html = url_open2(url)
            if html==None:
                continue
            doc = pq(html)
            jd_price = doc('#J_goodsList > ul > li:nth-child(1) > div > div.p-price > strong > i').text() #价格
            its = doc('.gl-warp.clearfix li div .p-commit strong a').items()  #评论数有点麻烦
            try:  #防止生成器为空调用next报错
                its.__next__() #因为所需的数据在生成器的第二项,所以先调用一次next
                jd_comments = its.__next__().text()
            except:
                jd_comments=None
            print(amazon_comments, amazon_price, title)
            print(jd_price,jd_comments)
            date=time.strftime("%Y-%m-%d", time.localtime())  #抓取日期

            #存入mysql
            cursor.execute("INSERT INTO data(book_asin,title,amazon_price,amazon_comments,jd_price,jd_comments,update_date) VALUES ('{0}','{1}','{2}','{3}','{4}','{5}','{6}');".format(asin,title,amazon_price,amazon_comments,jd_price,jd_comments,date))
            conn.commit()

            time.sleep(random.random())       #延迟0~1秒
        conn.close()


    if __name__=='__main__':
        q = Queue()    #多线程,数据量不多,用队列通信
        lock = Lock()
        with open('asin.txt', 'r')as f:
            AsinList = f.readlines()
        for each in AsinList[6000:]:        #老是被503,修该列表尽可能避免重复抓取
            q.put(each)

        #多线程一下子很快,但一小会就被封
        p1 = Process(target=spider, args=(q, lock))
        # p2 = Process(target=spider, args=(q, lock))
        # p3 = Process(target=spider, args=(q, lock))
        # p4 = Process(target=spider, args=(q, lock))
        # p5 = Process(target=spider, args=(q, lock))
        # p6 = Process(target=spider, args=(q, lock))
        # p7 = Process(target=spider, args=(q, lock))
        # p8 = Process(target=spider, args=(q, lock))
        p1.start(), \
        # p2.start(), p3.start(), p4.start(), p5.start(), p6.start(), p7.start(), p8.start()
        p1.join(),\
        # p2.join(),p3.join(), p4.join(), p5.join(), p6.join(), p7.join(), p8.join()


     

     

    展开全文
  • 于是,我自己写了一个基于Java的亚马逊图书监控的简单爬虫,只要出现特别优惠的书便会自动给指定的邮箱发邮件。 实现思路 简单地说一下实现的思路,本文只说明思路,需要完整项目的童鞋请移步文末 简单封装...
  • 破解了亚马逊的验证码验证后,爬了一天就报了503的错误,一开始只是等一段时间就能爬了,后来等待的时间越来越长。 <p>2.但是用浏览器直接打开amzon.com是可以打开的。 <p>3.我用了快代理࿰...
  • 获取电影名称 已经得到 productId 后,本项目还需获得另一个产品属性——电影名称,原始文件review.txt中缺少此项,本项目只能通过productId对Amazon网页进行爬虫获取电影名,获得的电影名与productId写入excel文件...
  • 本人在网上查阅Python核心编程(第3版)第一章课后答案基本都找不到1-32题答案,所以自己特地写了这个脚本,分享出来。脚本中提取出你所查询图书的简介和基本信息。
  • 亚马逊图书爬虫.py

    2019-05-30 00:18:35
    这个是爬取亚马逊图书资源的文件,实现简单的爬虫机制。
  • # -*- coding:utf-8 -*-import urllib,urllib2,refrom lxml import etreeimport requestsimport randomimport jsonimport confimport time,datetimeimport os,syssys.path.append('./')def randHeader():head_conne....
  • 亚马逊商品采集 爬虫

    千次阅读 热门讨论 2019-05-09 16:26:08
     今天搞了下亚马逊的商品爬虫,需求是设置关键词,爬取前100个商品的图片和描述。虽然在做之前就听说这玩意很坑,但实际做了才发现是真的坑2333。   首先,尝试用python+requests正常处理。结果一路顺风,啥问题...
  • Python亚马逊图书爬虫

    千次阅读 2018-02-23 15:54:51
    encoding=utf8 import requests import time ...article=str(article).replace('亚马逊编辑推荐:','') print article def main(): html=get_detail() parse_detail(html) if name == 'main': main()
  • 1 #生成随机头2 defrandHeader():3 head_connection = ['Keep-Alive', 'close']4 head_accept = ['text/html, application/xhtml+xml, */*']5 head_accept_language = ['zh-CN,fr-FR;q=0.5', 'en-US,en;...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 444
精华内容 177
关键字:

亚马逊爬虫

爬虫 订阅