Python crawler Knowledge Point four--scrapy framework

Source: Internet
Author: User

One. Scrapy structure Data

Explain:

1. Noun Analysis:

O?? Engines (Scrapy engine)
O?? Scheduler (Scheduler)
O?? Downloader (Downloader)
O?? Spider (Spiders)
O?? Project pipeline (item Pipeline)
O?? Downloader middleware (Downloader middlewares)
O?? Spider Middleware (spiders middlewares)
O?? Dispatch middleware (Scheduler middlewares)

2. Specific analysis

The Green Line is the data flow
?? Starting from the initial URL, scheduler will hand it over to downloader
Line download
?? After the download, it will be given to spider for analysis
?? There are two kinds of results from spider analysis.
?? One is a link that requires further crawling, such as the "next page" link, which
will be passed back to scheduler , and the data that needs to be saved is sent to the Item pipeline for
Post-processing (detailed analysis, filtering, storage, etc.).
?? In the data flow channel can also install a variety of middleware, to do the necessary
The processing to be processed.

Two. Initialize the crawler frame Scrapy

Command: Scrapy startproject qqnews

PS: The real project is written in spiders

three. Scrapy component Spider

Crawl process
? 1. Initializes a list of request URLs and specifies the post-download
The response callback function.
2. Parse the response in the parse callback and return to the dictionary, Item
object, the Request object, or their iteration object.
3. Inside the callback function, use the selector to parse the page content
, and generates the parsed result item.
4. The last item returned will typically be persisted to the database
(using item Pipeline) or using the feed exports
Save it to a file.

Example of a standard project structure:

1.items structure: Define variables according to different kinds of data structure definition

The 2.spider structure is introduced into the item and is populated with the item

3. Pipline to clean, verify, deposit database, filter, etc. follow-up processing

Item Pipeline Common scenarios
?? Clean up HTML data
?? Validate fetched data (check if Item contains some fields)
?? repeatability check (then discard)
?? Storing crawled data in a database

4.Scrapy Component Item Pipeline

The following methods are often implemented:
?? Open_spider (self, spider) when the spider opens the execution
?? Close_spider (self, spider) when the spider shuts down executes
?? From_crawler (CLS, crawler) can access core components such as configuration and
Signal, and register the hook function into the scrapy

Pipeline Real processing logic

Defines a Python class that implements the method Process_item(self, item,
Spider), return a dictionary or item, or throw a Dropitem
Exception to discard this item.

What type of pipeline is defined in 5.settings

Ongoing updates .... , you are welcome to pay attention to my public number Lhworld.

Python crawler Knowledge Point four--scrapy framework

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.