One. Scrapy structure Data
Explain:
1. Noun Analysis:
O?? Engines (Scrapy engine)
O?? Scheduler (Scheduler)
O?? Downloader (Downloader)
O?? Spider (Spiders)
O?? Project pipeline (item Pipeline)
O?? Downloader middleware (Downloader middlewares)
O?? Spider Middleware (spiders middlewares)
O?? Dispatch middleware (Scheduler middlewares)
2. Specific analysis
The Green Line is the data flow
?? Starting from the initial URL, scheduler will hand it over to downloader
Line download
?? After the download, it will be given to spider for analysis
?? There are two kinds of results from spider analysis.
?? One is a link that requires further crawling, such as the "next page" link, which
will be passed back to scheduler , and the data that needs to be saved is sent to the Item pipeline for
Post-processing (detailed analysis, filtering, storage, etc.).
?? In the data flow channel can also install a variety of middleware, to do the necessary
The processing to be processed.
Two. Initialize the crawler frame Scrapy
Command: Scrapy startproject qqnews
PS: The real project is written in spiders
three. Scrapy component Spider
Crawl process
? 1. Initializes a list of request URLs and specifies the post-download
The response callback function.
2. Parse the response in the parse callback and return to the dictionary, Item
object, the Request object, or their iteration object.
3. Inside the callback function, use the selector to parse the page content
, and generates the parsed result item.
4. The last item returned will typically be persisted to the database
(using item Pipeline) or using the feed exports
Save it to a file.
Example of a standard project structure:
1.items structure: Define variables according to different kinds of data structure definition
The 2.spider structure is introduced into the item and is populated with the item
3. Pipline to clean, verify, deposit database, filter, etc. follow-up processing
Item Pipeline Common scenarios
?? Clean up HTML data
?? Validate fetched data (check if Item contains some fields)
?? repeatability check (then discard)
?? Storing crawled data in a database
4.Scrapy Component Item Pipeline
The following methods are often implemented:
?? Open_spider (self, spider) when the spider opens the execution
?? Close_spider (self, spider) when the spider shuts down executes
?? From_crawler (CLS, crawler) can access core components such as configuration and
Signal, and register the hook function into the scrapy
Pipeline Real processing logic
Defines a Python class that implements the method Process_item(self, item,
Spider), return a dictionary or item, or throw a Dropitem
Exception to discard this item.
What type of pipeline is defined in 5.settings
Ongoing updates .... , you are welcome to pay attention to my public number Lhworld.
Python crawler Knowledge Point four--scrapy framework