Using Scrapy to crawler the news (i)

Last Update:2018-07-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scrapy Item Pipeline Learning notes

The item Pipeline is primarily used to collect data item from a web crawl (spider) and write to a database or file. Execution Mode

After the item is obtained, Spider is passed to the item pipeline for subsequent data collection.
The item pipeline classpath is configured in setting, and the Scrapy framework invokes the item pipeline class, in order to properly invoke the
The item pipeline class must implement some methods as required by the framework. Users only need to focus on implementing these methods. Example

The following file implements a simple item pipeline class to further process the captured news data and write to the file. The functionality of these methods is shown in annotations.
1. Document: pipelines.py

Precautions:
1. Initialization function implementation is very free, do not need to limit parameters, just ensure that the From_crawler class method can call the initialization function to generate the corresponding instance and can.
2. The methods used in the framework declare the parameters fixed. (guarantee that the framework will be invoked correctly)

#-*-Coding:utf-8-*-# Define Your item pipelines here # Don ' t forget to add your pipeline to the Item_pipelines set
    Ting # see:http://doc.scrapy.org/en/latest/topics/item-pipeline.html class News2filefor163pipeline (object): ""  Pipeline:process items given by Spider "" "Def __init__ (self, filepath, filename):" "Init for

    The pipeline Class "" "Self.fullname = filepath + '/' + filename self.id = 0 return
        def process_item (self, item, spider): "" "process each of the items from the spider.
        Example:check if item is OK or raise Dropitem exception.
        Example:do some process before writing into database.
        Example:check If item is exist and drop.
                "" for element in ("url", "source", "title", "editor", "Time", "content"): If Item[element] is None: Raise Dropitem ("Invalid items URL:%s"% str (item["url")) Self.fs.write ("News ID:%s"% self.id) self.fs.write ("\ n") self.id + = 1 self.fs.write ("URL:%s"% item["url"][0].strip (). Encode ('
        UTF-8 ')) self.fs.write ("\ n") Self.fs.write ("Source:%s"% item["source"][0].strip (). Encode (' UTF-8 ')) Self.fs.write ("\ n") Self.fs.write ("title:%s"% item["title"][0].strip (). Encode (' UTF-8 ')) self.fs.write
                      ("\ n") Self.fs.write ("Editor:%s"% item["editor"][0].strip (). Encode (' UTF-8 '). Split (': ') [1]) self.fs.write ("\ n") time_string = item["Time"][0].strip (). Split () d Atetime = time_string[0] + ' + time_string[1] self.fs.write (' Time:%s '% datetime.encode (' UTF-8 ')) self.  Fs.write ("\ n") content = "" For Para in item["content": Content + = Para.strip (). replace (' \ n ',
        '). replace (' \ t ', ') self.fs.write ("content:%s"% content.encode (' UTF-8 ')) self.fs.write ("\ n")
     Return Item def open_spider (self, spider):   "" "Called when Spider is opened.
        Do something before pipeline is processing items.
        Example:do settings or create connection to the database. "" "Self.fs = Open (Self.fullname, ' w+ ') return def close_spider (self, Spider):" "" CA
        Lled when Spider is closed.
        Do something after the pipeline processing all items.
        Example:close the database.
        "" "Self.fs.flush () Self.fs.close () return @classmethod def from_crawler (CLS, crawler):
        "" "Return to an pipeline instance.
        Example:initialize Pipeline object by crawler ' s setting and components. "" "Return CLS (Crawler.settings.get (' Item_file_path '), Crawler.settings.get (' Item_file_name '))

Crawl related configuration code for settting.py in project

# Configure Item Pipelines
# http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
item_ Pipelines = {
    ' NewsSpiderMan.pipelines.News2FileFor163Pipeline ': +,
}

Further requirements

If the content of the fetching data is very much, using item pipeline to data processing and writing to the database is kingly.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Using Scrapy to crawler the news (i)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Using Scrapy to crawler the news (i)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support