When item is collected in the Spider, it is passed to item Pipeline, and some components perform the processing of the item in a certain order.
Each item pipeline component (sometimes referred to as "item Pipeline") is a Python class that implements a simple method. They receive the item and perform some behavior through it, and also determine whether the item continues to pass through the pipeline, or is discarded and no longer processed.
Here are some typical applications for item pipeline:
- Clean up HTML data
- Validate crawled data (check item contains some fields)
- Duplicate checking (and discard)
- To save the crawl results to the database
Write your own item pipeline
Writing your own item pipeline is simple, each item pipeline component is a standalone Python class, and you must implement the following methods:
Process_item (self, item, spider)
Each item pipeline component needs to call the method, which must return an item (or any inherited Class) object, or throw an Dropitem exception, and the discarded item will not be processed by the subsequent pipeline component.
Parameters:
- Item (item object)-The item being crawled
- Spider (Spider object)-spider that crawls the item
In addition, you can implement the following methods:
Open_spider (self, spider)
This method is called when the spider is turned on.
Close_spider (spider)
This method is called when the spider is closed.
From_crawler (CLS, crawler)
If present, this classmethod are called to create a pipeline instance from a Crawler. It must return a new instance of the pipeline. Crawler object provides access to all scrapy core components like settings and signals; It is a-a-pipeline to access them and hooks it functionality into scrapy.
Parameters:
Crawler (Crawler object) –crawler that uses this pipeline
Item Pipeline Sample
Let's take a look at the following hypothetical pipeline, which adjusts the price attribute for item without tax (Price_excludes_vat attribute) and discards the item with no prices:
fromScrapy.exceptionsImportDropitemclassPricepipeline (object): Vat_factor= 1.15defProcess_item (self, item, spider):ifitem[' Price']: ifitem['Price_excludes_vat']: item[' Price'] = item[' Price'] *Self.vat_factorreturnItemElse: RaiseDropitem ("Missing Price in%s"% item)Write item to JSON file
The following pipeline stores all the item (from all spiders) to a standalone ITEMS.JL file, each containing an item serialized in JSON format:
ImportJSONclassJsonwriterpipeline (object):def __init__(self): Self.file= Open ('ITEMS.JL','WB') defProcess_item (self, item, spider): line= Json.dumps (Dict (item)) +"\ n"Self.file.write (line)returnItem
The purpose of Jsonwriterpipeline is simply to show you how to write the item pipeline, and if you want to save all the crawled item to the same JSON file, you need to use the Feed exports.
Write Item to MongoDB
classMongopipeline (object):def __init__(self, Mongo_uri, mongo_db): Self.mongo_uri=Mongo_uri self.mongo_db=mongo_db @classmethoddefFrom_crawler (CLS, crawler):returnCLS (Mongo_uri=crawler.settings.get ('Mongo_uri'), mongo_db=crawler.settings.get ('Mongo_database','Items') ) defOpen_spider (self, spider): Self.client=Pymongo. Mongoclient (Self.mongo_uri) self.db=self.client[self.mongo_db]defClose_spider (self, Spider): Self.client.close ()defProcess_item (self, item, spider): collection_name= Item.__class__.__name__Self.db[collection_name].insert (Dict (item))returnItemGo heavy
A filter that is used to remove the weight, discarding the item that has already been processed. Let's assume that our item has a unique ID, but the multiple item returned by our spider contains the same ID:
fromScrapy.exceptionsImportDropitemclassDuplicatespipeline (object):def __init__(self): Self.ids_seen=set ()defProcess_item (self, item, spider):ifitem['ID']inchSelf.ids_seen:RaiseDropitem ("Duplicate Item found:%s"%Item)Else: Self.ids_seen.add (item['ID']) returnItemEnable an Item Pipeline component
In order to enable an Item Pipeline component, you must add its class to the Item_pipelines configuration, as in the following example:
Item_pipelines = { 'myproject.pipelines.PricePipeline': 'myproject.pipelines.JsonWriterPipeline': +,}
The integer values assigned to each class determine the order in which they are run, and item is in the order of numbers from low to high, usually defined in the 0-1000 range by pipeline.
Crawler: Scrapy8-item Pipeline