Using Scrapy to crawler the news (i)

Source: Internet
Author: User
Scrapy Item Pipeline Learning notes

The item Pipeline is primarily used to collect data item from a web crawl (spider) and write to a database or file. Execution Mode

After the item is obtained, Spider is passed to the item pipeline for subsequent data collection.
The item pipeline classpath is configured in setting, and the Scrapy framework invokes the item pipeline class, in order to properly invoke the
The item pipeline class must implement some methods as required by the framework. Users only need to focus on implementing these methods. Example

The following file implements a simple item pipeline class to further process the captured news data and write to the file. The functionality of these methods is shown in annotations.
1. Document: pipelines.py

Precautions:
1. Initialization function implementation is very free, do not need to limit parameters, just ensure that the From_crawler class method can call the initialization function to generate the corresponding instance and can.
2. The methods used in the framework declare the parameters fixed. (guarantee that the framework will be invoked correctly)

#-*-Coding:utf-8-*-# Define Your item pipelines here # Don ' t forget to add your pipeline to the Item_pipelines set
    Ting # see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html class News2filefor163pipeline (object): ""  Pipeline:process items given by Spider "" "Def __init__ (self, filepath, filename):" "Init for

    The pipeline Class "" "Self.fullname = filepath + '/' + filename self.id = 0 return
        def process_item (self, item, spider): "" "process each of the items from the spider.
        Example:check if item is OK or raise Dropitem exception.
        Example:do some process before writing into database.
        Example:check If item is exist and drop.
                "" for element in ("url", "source", "title", "editor", "Time", "content"): If Item[element] is None: Raise Dropitem ("Invalid items URL:%s"% str (item["url")) Self.fs.write ("News ID:%s"% self.id) self.fs.write ("\ n") self.id + = 1 self.fs.write ("URL:%s"% item["url"][0].strip (). Encode ('
        UTF-8 ')) self.fs.write ("\ n") Self.fs.write ("Source:%s"% item["source"][0].strip (). Encode (' UTF-8 ')) Self.fs.write ("\ n") Self.fs.write ("title:%s"% item["title"][0].strip (). Encode (' UTF-8 ')) self.fs.write
                      ("\ n") Self.fs.write ("Editor:%s"% item["editor"][0].strip (). Encode (' UTF-8 '). Split (': ') [1]) self.fs.write ("\ n") time_string = item["Time"][0].strip (). Split () d Atetime = time_string[0] + ' + time_string[1] self.fs.write (' Time:%s '% datetime.encode (' UTF-8 ')) self.  Fs.write ("\ n") content = "" For Para in item["content": Content + = Para.strip (). replace (' \ n ',
        '). replace (' \ t ', ') self.fs.write ("content:%s"% content.encode (' UTF-8 ')) self.fs.write ("\ n")
     Return Item def open_spider (self, spider):   "" "Called when Spider is opened.
        Do something before pipeline is processing items.
        Example:do settings or create connection to the database. "" "Self.fs = Open (Self.fullname, ' w+ ') return def close_spider (self, Spider):" "" CA
        Lled when Spider is closed.
        Do something after the pipeline processing all items.
        Example:close the database.
        "" "Self.fs.flush () Self.fs.close () return @classmethod def from_crawler (CLS, crawler):
        "" "Return to an pipeline instance.
        Example:initialize Pipeline object by crawler ' s setting and components. "" "Return CLS (Crawler.settings.get (' Item_file_path '), Crawler.settings.get (' Item_file_name '))
Crawl related configuration code for settting.py in project
# Configure Item Pipelines
# http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
item_ Pipelines = {
    ' NewsSpiderMan.pipelines.News2FileFor163Pipeline ': +,
}
Further requirements

If the content of the fetching data is very much, using item pipeline to data processing and writing to the database is kingly.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.