a lot of learning python programming language friends will learn python web crawler technology, but also specialized in web crawler technology, then how to learn python crawler technology, Let's talk today about the very popular python crawl framework scrapyusing python to crawl data, Next, learn the architecture of scrapy to make it easier to use this tool.
I. Overview
Shows the general architecture of Scrapy , which contains its main components and the data processing flow of the system (shown by the green arrows). The following will explain the role of each component and the process of data processing.
Second, the component
1.scrapyengine(scrapy engines)
The Scrapy engine is used to control the data processing flow of the entire system and to trigger transactions. More detailed information can be found in the following data processing process.
2,Scheduler(Dispatch)
The scheduler accepts requests from the scrapy engine and sorts them into queues and returns them to them after the Scrapy engine makes a request.
3.Downloader(Downloader)
The main function of the downloader is to crawl the Web page and return the content to the Spider (Spiders).
4.Spiders(spider)
Spiders are scrapy . The user defines the class that is used to parse the Web page and crawl the content returned by the URL, each of which can process a domain name or a group of domain names. In other words, it is used to define the crawl and parse rules for a particular site.
The entire crawl process (cycle) of the spider is this:
First the initial request for the first URL is fetched , and a callback function is fetched when the request returns. The first request is by calling the start_requests () method. The method generates the request by default from the Url in Start_urls and performs a parse to invoke the callback function.
In the callback function, you can parse the Web page response and return an iteration of the project object and the request object or both. These requests will also contain a callback, which is then downloaded by scrapy and then has the specified callback processing.
In the callback function, you parse the contents of the site, using the Xpath selector (but you can also use beautifusoup, lxml or any other program you like), and generates the parsed data item.
Finally, items returned from spiders are usually stationed in the project pipeline.
5.ItemPipeline(Project pipeline)
The main responsibility of the project pipeline is to handle projects that have spiders extracted from web pages, and his main task is to clear, validate and store the data. When the page is parsed by the spider, it is sent to the project pipeline, and the data is processed in several specific order. Each project pipeline component is a Python class that consists of a simple method . They get the projects and execute their methods, and they need to decide whether they need to continue the next step in the project pipeline or simply discard them and leave them out of the process.
The process typically performed by a project pipeline is:
Cleaning HTML data
Verify the data that is parsed (check whether the project contains the necessary fields)
Check if duplicate data (delete if duplicate)
Store the parsed data in the database
6,Downloader middlewares(Downloader middleware)
The download middleware is a hook frame between the scrapy engine and the downloader, mainly dealing with requests and responses between the Scrapy engine and the downloader. It provides a way to customize the code to extend The functionality of the scrapy. The download intermediary is a hook frame that handles requests and responses. He is a lightweight, Low-level system that enjoys global control over scrapy.
7,spidermiddlewares(spiders middleware)
Spider Middleware is a hook frame between scrapy engine and spider, and the main work is to deal with Spider's response input and request output. It provides a way to customize the code to extend The functionality of the scrapy. Spider Middleware is a framework of Spider-handling mechanisms attached to scrapy, and you can insert custom code to handle requests sent to spiders and return spiders to get responses to content and items.
8,Scheduler middlewares(dispatch middleware)
Dispatch middleware is a middleware between scrapy engine and dispatch, which mainly works from scrapy engine to dispatch request and response. He provides a custom code to extend The functionality of the scrapy.
Third, data processing flow
Scrapy 's entire data processing process is controlled by the scrapy engine, which operates mainly in the following ways:
The engine opens a domain name, when the spider handles the domain name and lets the spider get the first crawl URL.
The engine gets the first URL to crawl from the spider , and then dispatches it as a request in the schedule.
The engine gets the page that crawls next from the dispatch.
The schedule returns the next crawled URL to the engine, which the engine sends to the downloader via the download middleware.
When the Web page is downloaded by the downloader, the response content is sent to the engine via the download middleware.
The engine receives a response from the downloader and sends it through the spider middleware to the spider for processing.
The spider processes the response and returns the crawled item, and then sends a new request to the engine.
The engine crawls to the project project pipeline and sends a request to the dispatch.
The system repeats the operation after the second part until there is no request in the schedule, and then disconnects the engine from the domain.
Four, drive
Scrapy is made up of Twisted write a popular Python Event-driven network framework that uses non-clogging asynchronous processing.
Understanding and understanding of Python open-source crawler Framework Scrapy