Scrapy crawl data for database storage and local storage

Source: Internet
Author: User

Today records the scrapy data stored in the local and database, not will not write, because the small series each write feel the same, so record, after the direct use on it-^o^-


1. Local storage

Set Pipel ines.py

  
 
  1. class Ak17Pipeline(object):
  2.    def __init__(self):
  3.        self.file = open(‘ak17.json‘, ‘w‘)     # 存储文件的类型
  4.    def process_item(self, item, spider):
  5.        result = json.dumps(dict(item), ensure_ascii=False) + ‘,\n‘
  6.        self.file.write(result)
  7.        return item
  8.    def close_spider(self):
  9.        self.file.close()

2. Store to the MongoDB database

Setting up the setting file

 
   
  
  1. # mongo数据库
  2. MONGO_HOST = "127.0.0.1"    # 数据库地址
  3. MONGO_PORT = 27017      # 端口号
  4. MONGO_DBNAME = "ak17"    # 数据库名称
  5. MONGO_COLNAME = "ak"     #  集合名称

Set Pipel ines.py
  
 
  1. class MongoPipeline(object):
  2.    """
  3.    保存进数据库
  4.    """
  5.    def __init__(self):
  6.        # 初始化操作
  7.        host = settings[‘MONGO_HOST‘]
  8.        port = settings[‘MONGO_PORT‘]
  9.        dbs = settings[‘MONGO_DBNAME‘]
  10.        colname = settings[‘MONGO_COLNAME‘]
  11.        # 链接数据库
  12.        self.db = MongoClient(host=host, port=port)
  13.        # 选择数据库
  14.        self.database = self.db[dbs]
  15.        # 选择集合
  16.        self.col = self.database[colname]
  17.    def process_item(self, item, spider):
  18.        # 插入数据
  19.        date = dict(item)
  20.        self.col.insert(date)
  21.        return item
  22.    def close_spider(self):
  23.        # 关闭链接
  24.        self.db.close()

3.MYSQL Database Storage

Setting up the setting file

 
   
  
  1. MYSQL_HOSTS = ‘127.0.0.1‘
  2. MYSQL_USER = ‘root‘
  3. MYSQL_PASSWORD = ‘root‘
  4. MYSQL_PORT = 3306
  5. MYSQL_DB=‘xiciip‘
  6. CHARSET=‘utf8‘

Set Pipel ines.py
  
 
  1. class WebcrawlerScrapyPipeline(object):
  2.    ‘‘‘保存到数据库中对应的class
  3.       1、在settings.py文件中配置
  4.       2、在自己实现的爬虫类中yield item,会自动执行‘‘‘
  5.    def __init__(self, dbpool):
  6.        self.dbpool = dbpool
  7.    @classmethod
  8.    def from_settings(cls, settings):
  9.        ‘‘‘1、@classmethod声明一个类方法,而对于平常我们见到的叫做实例方法。
  10.           2、类方法的第一个参数cls(class的缩写,指这个类本身),而实例方法的第一个参数是self,表示该类的一个实例
  11.           3、可以通过类来调用,就像C.f(),相当于java中的静态方法‘‘‘
  12.        #读取settings中配置的数据库参数
  13.        dbparams = dict(
  14.            host=settings[‘MYSQL_HOST‘],  
  15.            db=settings[‘MYSQL_DBNAME‘],
  16.            user=settings[‘MYSQL_USER‘],
  17.            passwd=settings[‘MYSQL_PASSWD‘],
  18.            charset=‘utf8‘,  # 编码要加上,否则可能出现中文乱码问题
  19.            cursorclass=MySQLdb.cursors.DictCursor,
  20.            use_unicode=False,
  21.        )
  22.        dbpool = adbapi.ConnectionPool(‘MySQLdb‘, **dbparams)  # **表示将字典扩展为关键字参数,相当于host=xxx,db=yyy....
  23.        return cls(dbpool)  # 相当于dbpool付给了这个类,self中可以得到
  24.    # pipeline默认调用
  25.    def process_item(self, item, spider):
  26.        query = self.dbpool.runInteraction(self._conditional_insert, item)  # 调用插入的方法异步处理
  27.        query.addErrback(self._handle_error, item, spider)  # 调用异常处理方法
  28.        return item
  29.    # 写入数据库中
  30.    # SQL语句在这里
  31.    def _conditional_insert(self, tx, item):
  32.        sql = "insert into jsbooks(author,title,url,pubday,comments,likes,rewards,views) values(%s,%s,%s,%s,%s,%s,%s,%s)"
  33.        params = (item[‘author‘], item[‘title‘], item[‘url‘], item[‘pubday‘],item[‘comments‘],item[‘likes‘],item[‘rewards‘],item[‘reads‘])
  34.        tx.execute(sql, params)
  35.    # 错误处理方法
  36.    def _handle_error(self, failue, item, spider):
  37.        print failue


Scrapy crawl data for database storage and local storage

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.