First of all have to admit that they did the title party, the essence of this article is the analysis of 500lines or less crawl project, the address of this project is Https://github.com/aosabook/500lines, interested students can see, is a A collection of very high-quality open source projects that are said to write a book, but look at the code submission record, the book should not be published very soon. This article is written very slag, mistakes must be mentioned ah ...
The web crawler starts from the URL of one or several initial web pages, obtains the URL on the initial page, and, in the process of crawling the Web page, continuously extracts a new URL from the current page into the queue until a certain stop condition of the system is satisfied. It is simple to understand the network crawler as a while loop with termination conditions, in case the condition does not trigger, the crawler constantly from each and get the URL to send a request to fetch the page data, and then parse the current page URL, continue to iterate. in the crawl project, the completion of this process is the crawler class, he did not use the breadth first or the depth of the first crawler, when the current request failed to suspend the current task through Python, and then after the scheduling, which can be barely understood as network connectivity based on a * search, It works as follows:
For an initialized crawler object, where there is a URL, a Todo collection, stores the URL of the crawler that has not yet resumed, a busy collection that holds a collection of URLs waiting for other crawlers, and a done collection that holds a collection of URLs that complete the page crawl. The core of the reptile is the dead loop, first the crawler gets a URL from the Todo collection, and then initializes the fetch object to get the URL on the page, and finally the Task Scheduler executes a URL request task. The code for this process is shown below.
1@asyncio. coroutine2Def Crawl(self):3"""Run the crawler until all finished. ""4with (yield from self.termination): 5 while self.todo or self.busy: 6 if self.todo: 7 url, max_redirect = Self.todo.popitem () 8 fetcher = Fetcher (URL, 9 Crawler=self,10 Max_redirect=max_redirect,11 Max_tries=self.max_tries,12) 13 self.busy[url] = Fetcher14 Fetcher.task = Asyncio. Task (Self.fetch (fetcher)) 15 else:16 yield from self.termination.wait () 17 self.t1 = Time.time ()
A crawler obviously does not consist solely of a dead loop, and in crawl the outer layer requires other modules to support its operations, including network connections, URL acquisition, task scheduling and other tasks, and the entire crawl project scheduling framework is as follows:
First create a connectionpool when crawl creates the initialization:
Self.pool = ConnectionPool (Max_pool, Max_tasks)
It retains the attributes connections and the queue, respectively saving the connected collections and queues for subsequent scheduling, while the connection stores the host and port number and supports SSL, obtaining the connection via Asyncio.open_connection ().
Self.connections = {} # {(Host, port, SSL): [Connection, ...], ...} Self.queue = [] # [Connection, ...]
When the task executes, the crawl method is first loaded into the event loop via Loop.run_until_complete (Crawler.crawl ()) and then The above statement builds the link pool connectionpool to save the connection object, gets the connection object, and then crawls the data through the fetch method of the Fetcher object. for a URL request task, the fetcher is used for processing, and the dispatch is Asyncio. the dispatch of the Task method. Where the fetch method gets the suspended generator, to Asyncio. Task execution.
By using the yield from and asynico.coroutine statements, this method becomes the generator in the execution process, and if it is suspended when the Fetcher.fetch () method is executed, it is processed by the scheduler.
The Fetcher.fetch () method is the core method of the web crawler, which is responsible for fetching page data from the network and loading the URL into the Todo collection, which attempts to fetch the page data when the number of attempts reaches the time limit to stop the operation. Getting successful HTML data and external links as well as redirect links will be stored. In the case where the URL link number reaches the upper limit, the link operation for this URL will be stopped and an error log is output. After that, for the different state of the page, take different processing methods.
The following code is the crawling.py file starting at 333 lines (crawling.py) to the end of the corresponding method of the region, through the page status of the decision to choose a different way of processing. The URL information on the page is obtained through regular expressions, where the string that begins with the href is selected, and the core URL extracts the code below:
1 #2 self.urls = set (Re.findall (4 Logger.warn ('9 url, frag = Urllib.parse.urldefrag (URL) if Self.crawle R.add_url (URL): Self.new_urls.add (URL)
Through the code, it is clear that the regular match results are stored in the URLs collection and processed sequentially through a for loop, adding to the Todo collection of the current Fetcher crawler object.
Based on the previous analysis of the main file crawl.py for further analysis, you can get the overall crawler architecture:
In the main file first through the Argparse. The argumentparser is parsed to set the console's data read and control, where IOCP is selected as the event loop object in the Windows environment. The main method, first returns the dictionary that stores the command line data via Parse_args, and gives a hint if there is no root attribute. The log level is then configured to indicate the output level of the log, which is below the lowest level of output.
When entering the program through the main method of the Import function, first initializes the crawler based on the command line arguments, and obtains the loop event object using Asyncio, performing the Run_ The Until_complete method will continue to execute until the end of the program runs.
In addition, reporting.py is used to print the current task execution. where Fetcher_report (fetcher, stats, File=none) prints the working status of this URL, url is the URL attribute of Fetcher; report (crawler, File=none) Prints the entire project with all completed URL working status.
At this point, the basic framework of crawl is present. As for some of the features of the Python language that are not easily understood in this program, some of the core modules that are applied to it will be the next blog, "Standard crawler analysis, streamlining is not easy!" is described in the ".
- Related articles recommended:
- Engineers discover Apple crawlers or to improve search experience
- GE Open Predix, expected to become the industrial Internet fact standard
- New HTML5 Standard ignores WHATWG
- This article from: Hobby Linux Technology Network
- This article link: http://www.ahlinux.com/news/8965.html
Standard Crawler, a feast from the father of Python!