Reptile: Scrapy8-item Pipeline

Last Update:2016-09-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When item is collected in the Spider, it is passed to item Pipeline, and some components perform the processing of the item in a certain order.

Each item pipeline component (sometimes referred to as "item Pipeline") is a Python class that implements a simple method. They receive the item and perform some behavior through it, and also determine whether the item continues to pass through the pipeline, or is discarded and no longer processed.

Here are some typical applications for item pipeline:

Clean up HTML data
Validate crawled data (check item contains some fields)
Duplicate checking (and discard)
To save the crawl results to the database

Write your own item pipeline

Writing your own item pipeline is simple, each item pipeline component is a standalone Python class, and you must implement the following methods:

Process_item (self, item, spider)

Each item pipeline component needs to call the method, which must return an item (or any inherited Class) object, or throw an Dropitem exception, and the discarded item will not be processed by the subsequent pipeline component.

Parameters:

Item (item object)-The item being crawled
Spider (Spider object)-spider that crawls the item

In addition, you can implement the following methods:

Open_spider (self, spider)

This method is called when the spider is turned on.

Close_spider (spider)

This method is called when the spider is closed.

From_crawler (CLS, crawler)

If present, this classmethod are called to create a pipeline instance from a Crawler. It must return a new instance of the pipeline. Crawler object provides access to all scrapy core components like settings and signals; It is a-a-pipeline to access them and hooks it functionality into scrapy.

Parameters:

Crawler (Crawler object) –crawler that uses this pipeline

Item Pipeline Sample

Let's take a look at the following hypothetical pipeline, which adjusts the price attribute for item without tax (Price_excludes_vat attribute) and discards the item with no prices:

 fromScrapy.exceptionsImportDropitemclassPricepipeline (object): Vat_factor= 1.15defProcess_item (self, item, spider):ifitem[' Price']:            ifitem['Price_excludes_vat']: item[' Price'] = item[' Price'] *Self.vat_factorreturnItemElse:            RaiseDropitem ("Missing Price in%s"% item)

Write item to JSON file

The following pipeline stores all the item (from all spiders) to a standalone ITEMS.JL file, each containing an item serialized in JSON format:

ImportJSONclassJsonwriterpipeline (object):def __init__(self): Self.file= Open ('ITEMS.JL','WB')    defProcess_item (self, item, spider): line= Json.dumps (Dict (item)) +"\ n"Self.file.write (line)returnItem

The purpose of Jsonwriterpipeline is simply to show you how to write the item pipeline, and if you want to save all the crawled item to the same JSON file, you need to use the Feed exports.

Write Item to MongoDB

classMongopipeline (object):def __init__(self, Mongo_uri, mongo_db): Self.mongo_uri=Mongo_uri self.mongo_db=mongo_db @classmethoddefFrom_crawler (CLS, crawler):returnCLS (Mongo_uri=crawler.settings.get ('Mongo_uri'), mongo_db=crawler.settings.get ('Mongo_database','Items')        )    defOpen_spider (self, spider): Self.client=Pymongo. Mongoclient (Self.mongo_uri) self.db=self.client[self.mongo_db]defClose_spider (self, Spider): Self.client.close ()defProcess_item (self, item, spider): collection_name= Item.__class__.__name__Self.db[collection_name].insert (Dict (item))returnItem

Go heavy

A filter that is used to remove the weight, discarding the item that has already been processed. Let's assume that our item has a unique ID, but the multiple item returned by our spider contains the same ID:

 fromScrapy.exceptionsImportDropitemclassDuplicatespipeline (object):def __init__(self): Self.ids_seen=set ()defProcess_item (self, item, spider):ifitem['ID']inchSelf.ids_seen:RaiseDropitem ("Duplicate Item found:%s"%Item)Else: Self.ids_seen.add (item['ID'])            returnItem

Enable an Item Pipeline component

In order to enable an Item Pipeline component, you must add its class to the Item_pipelines configuration, as in the following example:

Item_pipelines = {    'myproject.pipelines.PricePipeline':     'myproject.pipelines.JsonWriterPipeline': +,}

The integer values assigned to each class determine the order in which they are run, and item is in the order of numbers from low to high, usually defined in the 0-1000 range by pipeline.

Crawler: Scrapy8-item Pipeline

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Reptile: Scrapy8-item Pipeline

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support