Scrapy Item Pipeline Learning notes
The item Pipeline is primarily used to collect data item from a web crawl (spider) and write to a database or file. Execution Mode
After the item is obtained, Spider is passed to the item pipeline for subsequent data collection.
The item pipeline classpath is configured in setting, and the Scrapy framework invokes the item pipeline class, in order to properly invoke the
The item pipeline class must implement some methods as required by the framework. Users only need to focus on implementing these methods. Example
The following file implements a simple item pipeline class to further process the captured news data and write to the file. The functionality of these methods is shown in annotations.
1. Document: pipelines.py
Precautions:
1. Initialization function implementation is very free, do not need to limit parameters, just ensure that the From_crawler class method can call the initialization function to generate the corresponding instance and can.
2. The methods used in the framework declare the parameters fixed. (guarantee that the framework will be invoked correctly)
#-*-Coding:utf-8-*-# Define Your item pipelines here # Don ' t forget to add your pipeline to the Item_pipelines set
Ting # see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html class News2filefor163pipeline (object): "" Pipeline:process items given by Spider "" "Def __init__ (self, filepath, filename):" "Init for
The pipeline Class "" "Self.fullname = filepath + '/' + filename self.id = 0 return
def process_item (self, item, spider): "" "process each of the items from the spider.
Example:check if item is OK or raise Dropitem exception.
Example:do some process before writing into database.
Example:check If item is exist and drop.
"" for element in ("url", "source", "title", "editor", "Time", "content"): If Item[element] is None: Raise Dropitem ("Invalid items URL:%s"% str (item["url")) Self.fs.write ("News ID:%s"% self.id) self.fs.write ("\ n") self.id + = 1 self.fs.write ("URL:%s"% item["url"][0].strip (). Encode ('
UTF-8 ')) self.fs.write ("\ n") Self.fs.write ("Source:%s"% item["source"][0].strip (). Encode (' UTF-8 ')) Self.fs.write ("\ n") Self.fs.write ("title:%s"% item["title"][0].strip (). Encode (' UTF-8 ')) self.fs.write
("\ n") Self.fs.write ("Editor:%s"% item["editor"][0].strip (). Encode (' UTF-8 '). Split (': ') [1]) self.fs.write ("\ n") time_string = item["Time"][0].strip (). Split () d Atetime = time_string[0] + ' + time_string[1] self.fs.write (' Time:%s '% datetime.encode (' UTF-8 ')) self. Fs.write ("\ n") content = "" For Para in item["content": Content + = Para.strip (). replace (' \ n ',
'). replace (' \ t ', ') self.fs.write ("content:%s"% content.encode (' UTF-8 ')) self.fs.write ("\ n")
Return Item def open_spider (self, spider): "" "Called when Spider is opened.
Do something before pipeline is processing items.
Example:do settings or create connection to the database. "" "Self.fs = Open (Self.fullname, ' w+ ') return def close_spider (self, Spider):" "" CA
Lled when Spider is closed.
Do something after the pipeline processing all items.
Example:close the database.
"" "Self.fs.flush () Self.fs.close () return @classmethod def from_crawler (CLS, crawler):
"" "Return to an pipeline instance.
Example:initialize Pipeline object by crawler ' s setting and components. "" "Return CLS (Crawler.settings.get (' Item_file_path '), Crawler.settings.get (' Item_file_name '))
Crawl related configuration code for settting.py in project
# Configure Item Pipelines
# http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
item_ Pipelines = {
' NewsSpiderMan.pipelines.News2FileFor163Pipeline ': +,
}
Further requirements
If the content of the fetching data is very much, using item pipeline to data processing and writing to the database is kingly.