International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Python crawler Knowledge Point four--scrapy framework

Last Update:2017-11-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

One. Scrapy structure Data

Explain:

1. Noun Analysis:

O?? Engines (Scrapy engine)
O?? Scheduler (Scheduler)
O?? Downloader (Downloader)
O?? Spider (Spiders)
O?? Project pipeline (item Pipeline)
O?? Downloader middleware (Downloader middlewares)
O?? Spider Middleware (spiders middlewares)
O?? Dispatch middleware (Scheduler middlewares)

2. Specific analysis

The Green Line is the data flow
?? Starting from the initial URL, scheduler will hand it over to downloader
Line download
?? After the download, it will be given to spider for analysis
?? There are two kinds of results from spider analysis.
?? One is a link that requires further crawling, such as the "next page" link, which
will be passed back to scheduler , and the data that needs to be saved is sent to the Item pipeline for
Post-processing (detailed analysis, filtering, storage, etc.).
?? In the data flow channel can also install a variety of middleware, to do the necessary
The processing to be processed.

Two. Initialize the crawler frame Scrapy

Command: Scrapy startproject qqnews

PS: The real project is written in spiders

three. Scrapy component Spider

Crawl process
? 1. Initializes a list of request URLs and specifies the post-download
The response callback function.
2. Parse the response in the parse callback and return to the dictionary, Item
object, the Request object, or their iteration object.
3. Inside the callback function, use the selector to parse the page content
, and generates the parsed result item.
4. The last item returned will typically be persisted to the database
(using item Pipeline) or using the feed exports
Save it to a file.

Example of a standard project structure:

1.items structure: Define variables according to different kinds of data structure definition

The 2.spider structure is introduced into the item and is populated with the item

3. Pipline to clean, verify, deposit database, filter, etc. follow-up processing

Item Pipeline Common scenarios
?? Clean up HTML data
?? Validate fetched data (check if Item contains some fields)
?? repeatability check (then discard)
?? Storing crawled data in a database

4.Scrapy Component Item Pipeline

The following methods are often implemented:
?? Open_spider (self, spider) when the spider opens the execution
?? Close_spider (self, spider) when the spider shuts down executes
?? From_crawler (CLS, crawler) can access core components such as configuration and
Signal, and register the hook function into the scrapy

Pipeline Real processing logic

Defines a Python class that implements the method Process_item(self, item,
Spider), return a dictionary or item, or throw a Dropitem
Exception to discard this item.

What type of pipeline is defined in 5.settings

Ongoing updates .... , you are welcome to pay attention to my public number Lhworld.

Python crawler Knowledge Point four--scrapy framework

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Python abstract class (ABC module) 09-18

The difference between OS and sys two modules in Python 04-05

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler Knowledge Point four--scrapy framework

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support