精华内容
下载资源
问答
  • I am new to Scrapy, I had the spider codeclass Example_spider(BaseSpider):name = "example"allowed_domains = ["www.example.com"]def start_requests(self):yield self.make_requests_from_url(...

    I am new to Scrapy, I had the spider code

    class Example_spider(BaseSpider):

    name = "example"

    allowed_domains = ["www.example.com"]

    def start_requests(self):

    yield self.make_requests_from_url("http://www.example.com/bookstore/new")

    def parse(self, response):

    hxs = HtmlXPathSelector(response)

    urls = hxs.select('//div[@class="bookListingBookTitle"]/a/@href').extract()

    for i in urls:

    yield Request(urljoin("http://www.example.com/", i[1:]), callback=self.parse_url)

    def parse_url(self, response):

    hxs = HtmlXPathSelector(response)

    main = hxs.select('//div[@id="bookshelf-bg"]')

    items = []

    for i in main:

    item = Exampleitem()

    item['book_name'] = i.select('div[@class="slickwrap full"]/div[@id="bookstore_detail"]/div[@class="book_listing clearfix"]/div[@class="bookstore_right"]/div[@class="title_and_byline"]/p[@class="book_title"]/text()')[0].extract()

    item['price'] = i.select('div[@id="book-sidebar-modules"]/div[@class="add_to_cart_wrapper slickshadow"]/div[@class="panes"]/div[@class="pane clearfix"]/div[@class="inner"]/div[@class="add_to_cart 0"]/form/div[@class="line-item"]/div[@class="line-item-price"]/text()').extract()

    items.append(item)

    return items

    And pipeline code is:

    class examplePipeline(object):

    def __init__(self):

    self.dbpool = adbapi.ConnectionPool('MySQLdb',

    db='blurb',

    user='root',

    passwd='redhat',

    cursorclass=MySQLdb.cursors.DictCursor,

    charset='utf8',

    use_unicode=True

    )

    def process_item(self, spider, item):

    # run db query in thread pool

    assert isinstance(item, Exampleitem)

    query = self.dbpool.runInteraction(self._conditional_insert, item)

    query.addErrback(self.handle_error)

    return item

    def _conditional_insert(self, tx, item):

    print "db connected-=========>"

    # create record if doesn't exist.

    tx.execute("select * from example_book_store where book_name = %s", (item['book_name']) )

    result = tx.fetchone()

    if result:

    log.msg("Item already stored in db: %s" % item, level=log.DEBUG)

    else:

    tx.execute("""INSERT INTO example_book_store (book_name,price)

    VALUES (%s,%s)""",

    (item['book_name'],item['price'])

    )

    log.msg("Item stored in db: %s" % item, level=log.DEBUG)

    def handle_error(self, e):

    log.err(e)

    After running this I am getting the following error

    exceptions.NameError: global name 'Exampleitem' is not defined

    I got the above error when I added the below code in process_item method

    assert isinstance(item, Exampleitem)

    and without adding this line I am getting

    **exceptions.TypeError: 'Example_spider' object is not subscriptable

    Can anyone make this code run and make sure that all the items saved into database?

    解决方案

    Try the following code in your pipeline

    import sys

    import MySQLdb

    import hashlib

    from scrapy.exceptions import DropItem

    from scrapy.http import Request

    class MySQLStorePipeline(object):

    def __init__(self):

    self.conn = MySQLdb.connect('host', 'user', 'passwd',

    'dbname', charset="utf8",

    use_unicode=True)

    self.cursor = self.conn.cursor()

    def process_item(self, item, spider):

    try:

    self.cursor.execute("""INSERT INTO example_book_store (book_name, price)

    VALUES (%s, %s)""",

    (item['book_name'].encode('utf-8'),

    item['price'].encode('utf-8')))

    self.conn.commit()

    except MySQLdb.Error, e:

    print "Error %d: %s" % (e.args[0], e.args[1])

    return item

    展开全文
  • scrapy异步写入

    2018-09-18 09:18:06
    自定义图片存储pipeline,是基于Scrapy自带的ImagePipeline实现的,只需要在ImagePipeline的基础上重写图片的保存路径和名称相对应的方法。 class ImagePipeline(ImagesPipeline): # 遍历图片的url,然后把图片的...
    class MYSQLTwistedPipeline(object):
    
        def __init__(self, pool):
            self.dbpool = pool
    
        @classmethod
        # def from_crawler(cls, crawler):
        def from_settings(cls, settings):
            """
            这个函数名称是固定的,当爬虫启动的时候,scrapy会自动执行这些函数,加载配置.
            :param settings:
            :return:
            """
            params = dict(
                host=settings["MYSQL_HOST"],
                port=settings["MYSQL_POR"],
                db=settings["MYSQL_DB"],
                user=settings["MYSQL_USER"],
                passwd=settings["MYSQL_PASSWD"],
                charset=settings["MYSQL_CHARSET"],
                cursorclass=pymysql.cursors.DictCursor
            )
            # 创建一个数据库连接池对象,这个连接池中可以包含多个Connect连接对象
            # 参数一:操作数据库的包名
            # 参数二:连接数据库的参数
            db_connect_pool = adbapi.ConnectionPool("pymysql", **params)
            # 初始化这个类的对象
            obj = cls(db_connect_pool)
            return obj
    
        def process_item(self, item, spider):
            """
            在连接池中,开始执行数据的多线程写入操作
            :param item:
            :param spider:
            :return:
            """
            # 参数一:在线程中被执行的sql语句
            # 参数二:要保存的数据
            result = self.dbpool.runInteraction(self.insert, item)
            # 给result绑定一个回调函数,用于监听错误信息
            result.addErrback(self.error)
            return item
    
        def error(self, reason):
            print(f"错误原因是{reason}\n")
    
        def insert(self, cursor, item):
            insert_data = "insert into bole(title, content_string, content_type, vote_num, collection_num, comment_num, content_time, img_path)values ('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')" % (item["title"], item["content_string"], item["content_type"], item["vote_num"], item["collection_num"], item["comment_num"], item["content_time"], item["img_path"])
            cursor.execute(insert_data)
            # 不需要commit()
    
    展开全文
  • Scrapy 异步写入Mysql

    2019-02-21 16:07:53
    python3 异步写入MySQL pipelines.py # pipelines.py from .settings import MY_SETTINGS from pymysql import cursors # twisted 网络框架 # API 接口 from twisted.enterprise import adbapi class ...

    python3 异步写入MySQL

     

    十分想念顺店杂可。。。

     

    pipelines.py

    # pipelines.py
    
    from .settings import MY_SETTINGS
    from pymysql import cursors
    # twisted 网络框架
    # API 接口
    from twisted.enterprise import adbapi
    
    
    class SaveToMysqlAsynPipeline(object):
        # 从配置文件中读取配置
        @classmethod
        def from_settings(cls, settings):
            asyn_mysql_settings = MY_SETTINGS
            asyn_mysql_settings['cursorclass'] = cursors.DictCursor
            dbpool = adbapi.ConnectionPool("pymysql", **asyn_mysql_settings)
            return cls(dbpool)
    
        def __init__(self, dbpool):
            self.dbpool = dbpool
            if os.path.exists("job.state"):
                bloom = Bloomfilter("job.state")
            else:
                bloom = Bloomfilter(1000000)
            self.bloom = bloom
            query = self.dbpool.runInteraction(self.db_create)
            query.addErrback(self.db_create_err)
    
        def db_create(self, cursor):
            cursor.execute("""
                    CREATE TABLE IF NOT EXISTS `job` (
                        job_id INTEGER PRIMARY KEY AUTO_INCREMENT,
                        job_name text COMMENT '工作名称', 
                        job_money text COMMENT '工作薪资',
                        max_money FLOAT COMMENT '最大薪资',
                        min_money FLOAT COMMENT '最少薪资',
                        job_date text COMMENT '工作发布时间',
                        company_name text COMMENT '公司名称',
                        job_place text COMMENT '工作地点',
                        job_city text COMMENT '工作城市',
                        job_area text COMMENT '工作地区',
                        job_education text COMMENT '工作学历',
                        job_fuli text COMMENT '公司福利',
                        job_from text COMMENT '工作所属网站',
                        job_type text COMMENT '工作类型',
                        job_detail_href text COMMENT '详情地址',
                        job_state text COMMENT '工作数据的加密信息'
                    )
                    """)
    
        def db_create_err(self, failure):
            print('---------------------------', failure)
    
        def process_item(self, item, spider):
            query = self.dbpool.runInteraction(self.db_insert, item)
            query.addErrback(self.handle_error, item)
            return item
    
        def handle_error(self, failure, item):
            print('============================', failure, item)
    
        def db_insert(self, cursor, item):
            job_state = json.dumps(dict(item))
            hl = hashlib.md5()
            hl.update(job_state.encode(encoding='utf-8'))
            job_state = hl.hexdigest()
    
            if not self.bloom.test(item['job_detail_href']):
                print("添加数据========================")
                cursor.execute("""
                       INSERT INTO job ( job_name, job_money, max_money, min_money, job_date, company_name, job_place, job_city, job_area, job_education, job_fuli, job_from, job_type, job_detail_href, job_state ) VALUES ( %s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s )
                       """, (item['job_name'], item['job_money'], item['max_money'], item['min_money'], item['job_date'],
                             item['company_name'], item['job_place'], item['job_city'], item['job_area'],
                             item['job_education'], item['job_fuli'], item['job_from'], item['job_type'],
                             item['job_detail_href'], job_state))
                self.bloom.add(item['job_detail_href'])
                self.bloom.save("job.state")
            else:
                cursor.execute("""SELECT job_state from job WHERE job_detail_href=%s""", (item['job_detail_href'],))
                result = cursor.fetchone()
                if result and result['job_state'] != job_state:
                    print("更新数据=========================")
                    cursor.execute("""
                          UPDATE job set job_name=%s, job_money=%s, max_money=%s, min_money=%s, job_date=%s, company_name=%s, job_place=%s, job_city=%s, job_area=%s, job_education=%s, job_fuli=%s, job_from=%s, job_type=%s WHERE job_detail_href=%s
                          """, (item['job_name'], item['job_money'], item['max_money'], item['min_money'], item['job_date'],
                                item['company_name'], item['job_place'], item['job_city'], item['job_area'], item['job_education'],
                                item['job_fuli'], item['job_from'], item['job_type'], item['job_detail_href']))
                else:
                    print("不用更新数据=========================")
            return item
    
        def open_spider(self, spider):
            pass
    
        def close_spider(self, spider):
            pass
    

    settings.py

    # settings.py
    
    # settings.py文件下添加mysql的配置信息
    MY_SETTINGS = {
        "host": "localhost",
        "user": "root",
        "passwd": "123456",
        "db": "jobs",
        "port": 3306,
        "charset": "utf8",
        'use_unicode': True,
    }
    # 找到settings文件内ITEM_PIPELINES 让这个pipelines生效
    ITEM_PIPELINES = {
       'Ccxi_Spider.pipelines.SaveToMysqlAsynPipeline': 300,
    }
    

    结束,喜欢请关注

     

    pythonQQ交流群:785239887
    

     

    展开全文
  • scrapy 爬取写入MongoDB

    2018-05-06 13:04:54
    建立MongoDB服务: 打开MongoDB的下载路径,进入bin文件夹下:mongod -dbpath F:\mongod\data\db 另启一个命令行窗口(当前窗口不要关闭),进入bin文件夹下...import scrapy import pymongo from bs4 import Bea...

    建立MongoDB服务

        打开MongoDB的下载路径,进入bin文件夹下:

    mongod -dbpath F:\mongod\data\db

         另启一个命令行窗口(当前窗口不要关闭),进入bin文件夹下:

    mongo

    法一:

        爬虫文件:

    #import modules
    import bs4
    import scrapy
    import pymongo
    from bs4 import BeautifulSoup
    from pymongo import MongoClient
    
    class UniversityRankSpider(scrapy.Spider):
        name = "university-rank"  #name of spider
        start_urls = ['http://gaokao.xdf.cn/201702/10612921.html',]  #url of website
    
        def parse(self, response):  #parse function
            content = response.xpath("//tbody").extract()[0]
            soup = BeautifulSoup(content, "lxml")  #use BeautifulSoup      
            table = soup.find('tbody')
            count = 0 
            lst = []   # list to save data from the table
            for tr in table.children:  #BeautifulSoup grammmer
                if isinstance(tr, bs4.element.Tag):
                    td = tr('td')
                    if count >= 2:  #ingore the first line
                        lst.append([td[i]('p')[0].string.replace('\n','').replace('\t','') for i in range(8)])
                    count += 1
    
            conn = MongoClient('mongodb://localhost:27017/')  #connect mongodb
            db = conn.testdb
    
            for item in lst:  #insert data into university_rank table
                db.university_rank.insert([
                {'rank':'%s'%item[0], 'university':'%s'%item[1], 'address':'%s'%item[2], 'local_rank':'%s'%item[3],
                     'total grade':'%s'%item[4], 'type':'%s'%item[5], 'star rank':'%s'%item[6], 'class':'%s'%item[7]},
            ]) 
    
            print 'Successfully downloading data from website, and write it to mongodb database!'

    法二:

        

    # pipelines to insert the data into mongodb
    import pymongo
    from scrapy.conf import settings
    
    class BankPipeline(object):
        def __init__(self):
            # connect database
            self.client = pymongo.MongoClient(host=settings['MONGO_HOST'], port=settings['MONGO_PORT'])
    
            # using name and password to login mongodb
            # self.client.admin.authenticate(settings['MINGO_USER'], settings['MONGO_PSW'])
    
            # handle of the database and collection of mongodb
            self.db = self.client[settings['MONGO_DB']]
            self.coll = self.db[settings['MONGO_COLL']] 
    
        def process_item(self, item, spider):
            postItem = dict(item)
            self.coll.insert(postItem)
            return item



    展开全文
  • scrapy框架写入json数据

    2018-06-15 10:01:41
    # -*- coding: utf-8 -*- import scrapy from ..items import BookItem class NovelSpider(scrapy.Spider): name = 'novel' allowed_domains = ['readnovel.com'] start_urls = ['https://www.readnove...
  • scrapy爬虫写入数据库万能语句

    千次阅读 2018-06-07 11:52:05
    语句写入 数据    '''   # zip(可迭代对象)--打包  a = [1,2,3]   b = [4,5,6]   c = [7,8,9,10]   zip(a,b) ---> [(1,4),(2,5),(3,6)]   zip(a,c) ---> [(1,7),(2,8),(3,9)],把多余的筛选掉 ''' cols...
  • scrapy配置items数据写入json当中 scrapy数据的保存都交由 pipelines.py 处理,接前几篇, 导出文件的常用格式和方法(scrapy自带的): https://docs.scrapy.org/en/latest/topics/exporters.html scrapy自带的用不好...
  • Scrapy 导出的 CSV 是按字母排序的 但是我想按照我自己想要的顺序进行排序 问题解决: 在 settings 里面,添加这一行: FEED_EXPORT_FIELDS = [“name”, “title”, “info”] 写入结果如下:
  • # 在项目目录下的items中定义类型,并插入数据,目的在于不用判断不同的item对象,传递过来的item是...class ImgItem(scrapy.Item): #获取的图片链接 src = scrapy.Field() url = scrapy.Field() title = scrap...
  • 使用scrapy进行爬虫的时候,将文本写入到csv文件时候报错 原因分析: ‘\xa0’ 在Unicode编码中是空格 但使用gbk进行编码的时候就或报错 解决方案 string.replace(u'\xa0', u' ') 参考: ...
  • scrapy数据库异步写入

    千次阅读 2018-07-09 21:30:07
    # 数据库pymysql的commit()和execute()在提交数据时,都是同步提交至数据库,由于scrapy框架数据的解析是异步多线程的,所以scrapy的数据解析速度,要远高于数据的写入数据库的速度。如果数据写入过慢,会造成数据库...
  • 使用Scrapy在爬取百度的过程中,在下载中间件中加入Selenium返回加载好的页面并解析,但是使用pipeline无法把爬到的数据写入文件 探索过程 已经设置pipelines.py文件 已经在settings.py中打开管道 spider文件中的...
  • scrapy 多线程写入操作

    千次阅读 2018-07-09 22:38:38
    多线程mysql写入操作# 数据库pymysql的commit()和execute()在提交数据时,都是同步提交至数据库,由于scrapy框架数据的解析和异步多线程的,所以scrapy的数据解析速度,要远高于数据的写入数据库的速度。如果数据...
  • # 将组合后的当前页中第j个商品的数据写入json文件 i = json.dumps(dict(goods), ensure_ascii=False) line = i + '\n' self.file.write(line) # 返回item return item def close_spider(self,spider):...
  • Scrapy将数据写入Elasticsearch

    千次阅读 热门讨论 2018-03-19 15:03:16
    Scrapy将数据写入到Elsaticsearch 安装Elasticsearch 这里我们安装的是elasticsearch-rtf (elasticsearch中文发行版,针对中文集成了相关插件,方便新手学习测试。) 这里是github上的链接,可以使用git...
  • 最近上web课,有一门就是scrapy爬虫写入txt文件,自己也是找了很多种方法,最终实现 写入json或者csv 执行爬虫文件时添加 -o spidername.csv或者-o spidername.json spider那么是你创建的爬虫的名字 json文件...
  • scrapy 写入json文件

    千次阅读 2016-07-12 09:28:33
    pipelines.py class GuomeiPipeline(object): def __init__(self): self.file = codecs.open('aa.json', 'w', encoding='utf-8') def process_item(self, item, spider): line = json.
  • scrapy 序列化写入Scrapy支持多种序列化格式(serialization format)及存储方式(storage backends)。 如果你是想单纯的将数据输出或存入文件,那直接可以用Scrapy提供的现成类。 Item Exporters 为了使用 Item ...
  • Scrapy高并发数据库写入

    千次阅读 2019-05-02 21:56:53
    前言 爬虫过程中不可缺少的环节就是数据存储,一般来说这些数据首选是...Scrapy 是基于 Twisted 库实现的爬虫框架,而 Twisted 库已经为我们准备好了异步写入数据库的方法,配置也很简单,在 pipelines.py 里定义一个...
  • scrapy无法存入数据

    千次阅读 2017-07-13 20:54:31
    当整个scrapy爬取框架搭建好后,items,pipeline都设置好了,却发现通过Pipeline无法存入文件,这时候就需要设置settings.py了在scrapy中settings中pipeline的开关是默认关闭的,需要将其注释去掉,之后就可以发挥...
  • Scrapy入门教程之写入数据库

    千次阅读 2016-04-20 23:41:48
    scrapy入门教程之写入数据库
  • 写入日志: 首先我的爬虫 name= article scrapy crawl article -s LOG_FILE=wiki.log 输出为不同格式: scrapy crawl article -o articles.csv -t csv scrapy crawl article -o articles.json -t json scrapy ...
  • 前言: 在使用Scrapy写项目时,难免有时会需要将数据写入csv文件中,自带的FEED写法如下:settings.py (系统:Ubuntu 14)FEED_URI = 'file:///home/eli/Desktop/qtw.csv' FEED_FORMAT = 'CSV' 无需另写pipeline类...
  • scrapy 使用pipeline写入数据到mongoDB

    千次阅读 2019-01-12 17:09:56
    """将item写入MongoDB""" def __init__(self): """获取mongoDB的连接,获取数据库doubanDetail,获取集合newBookDetail""" client = pymongo.MongoClient('localhost', 27017) db = client.doubanDetail self....

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 7,188
精华内容 2,875
关键字:

scrapy无法写入