A Preliminary Study on standard crawlers: big meal from the father of Python!

Last Update:2014-12-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First of all, I have to admit that I have made a title party. The essence of this article is to analyze the crawl project of 500 lines or less. The address of this project is. This article is very messy, and you must mention the mistake...

Web Crawlers obtain the URLs on the initial web page from the URLs of one or more initial web pages. When capturing the web page, they constantly extract new URLs from the current page and put them in the queue, until the system is stopped. A web crawler can be understood as a while loop with a termination condition. When the condition is not triggered, the crawler continuously sends requests to get page data from each and the obtained url, then parse the url of the current page and continue iteration. In the crawl project, the crawler class completes this process. It does not use a crawler with the breadth-first or depth-first. When the current request fails, it suspends the current task through python, then, you can schedule the task later. This can barely be understood as A * search based on network connectivity. The running method is as follows:

For an initialized crawler object, there is a url, a todo set, which stores the URLs that have not continued crawling; a busy set that stores the URLs that are waiting for other crawler data; A done set that stores the url set crawled by the page. The core of the crawler is this endless loop. First, the crawler obtains a url from the todo set, then initializes the fetch object to get the url on the page, and finally schedules the task to execute a url request task. The code for this process is as follows.

 
 
  
  @asyncio.coroutine 
  
  def crawl(self): 
  
          """Run the crawler until all finished.""" 
  
          with (yield from self.termination): 
  
              while self.todo or self.busy: 
  
                  if self.todo: 
  
                      url, max_redirect = self.todo.popitem() 
  
                      fetcher = Fetcher(url, 
  
                                        crawler=self, 
  
                                        max_redirect=max_redirect, 
  
                                        max_tries=self.max_tries, 
  
                                        ) 
  
                      self.busy[url] = fetcher 
  
                      fetcher.task = asyncio.Task(self.fetch(fetcher)) 
  
                  else: 
  
                      yield from self.termination.wait() 
  
          self.t1 = time.time()

A crawler is obviously not only composed of an endless loop. In the outer layer of crawl, other modules need to support its operations, including network connection, url retrieval, task scheduling, and other tasks, the scheduling framework of the entire crawl project is as follows:

When crawl is created and initialized, A ConnectionPool is created first:

 
 
  
  self.pool = ConnectionPool(max_pool, max_tasks)

The connections and queue attributes are retained, respectively, to save the set and queue of connections for subsequent scheduling. The host and port number are stored in the connection and ssl is supported, and the connection is obtained through asyncio. open_connection.

 
 
  
  self.connections = {} # {(host, port, ssl): [Connection, ...], ...} 
  
  self.queue = [] # [Connection, ...]

When a task is executed, the crawl method first uses loop. run_until_complete (crawler. crawl () is loaded into the event loop, and then saves the connection object in the connection pool ConnectionPool built using the preceding statement to obtain the connection object, and then crawls data through the fetch method of The fetcher object. For a url request Task, fetcher is used for processing, and asyncio. Task is used for scheduling. The fetch method obtains the suspended generator and submits it to asyncio. Task for execution.

Use the yield from and asynico. coroutine statements to convert this method into a generator during execution. If it is suspended when the fetcher. fetch () method is executed, it is processed by the scheduler.

Fetcher. the fetch () method is the core method of web crawlers. It obtains page data from the network and loads the url to the todo set, this method is used to obtain page data. When the number of attempts reaches the upper limit, the operation is stopped. The obtained html data, external links, and redirection links are stored. When the maximum number of url links is reached, the url link operation will be stopped and error logs will be output. Then, different processing methods are adopted for different page states.

The following code crawling. py file starts from row 333 to the area where the corresponding method ends. Different processing methods are selected based on the page status judgment. Obtain the url Information on the page through a regular expression. Here, select a string starting with href. The Code extracted from the core url is as follows:

 
 
  
  # Replace href with (?:href|src) to follow image links. 
  
  self.urls = set(re.findall(r'(?i)href=["\']?([^\s"\'<>]+)',body)) 
  
  if self.urls: 
  
      logger.warn('got %r distinct urls from %r',len(self.urls), self.url) 
  
      self.new_urls = set() 
  
      for url in self.urls: 
  
          url = unescape(url) 
  
          url = urllib.parse.urljoin(self.url, url) 
  
          url, frag = urllib.parse.urldefrag(url) 
  
          if self.crawler.add_url(url): 
  
              self.new_urls.add(url)

The Code clearly shows that the regular expression matching results are stored in the urls set and processed in sequence through the for loop. The results are added to the todo set of the crawler object of the current fetcher.

Based on the previous analysis, the main file crawl. py is further analyzed to obtain the overall crawler architecture:

In the main file, you first use argparse. ArgumentParser for parsing and Set Data Reading and control on the console. IOCP is selected as the event loop object in windows. The main method first returns the dictionary for storing command line data through parse_args. If the root attribute is not available, a prompt is displayed. Then configure the log level to indicate the log output level. If the log output level is lower than the minimum level, no output is performed.

When the main method of the entry function is used to enter the program, the Crawler is initialized Based on the command line parameters, the loop event object of asyncio is obtained, and the run_until_complete method is executed, it will be executed until the program ends.

In addition, reporting. py is used to print the execution status of the current task. Fetcher_report (fetcher, stats, file = None) prints the working status of the url. The url is the url attribute of fetcher; report (crawler, file = None) print the status of all completed URLs of the entire project.

At this point, the basic framework of crawl is displayed. As for some of the python language features that are not easy to understand in this program, some of the core modules used will be in the next blog "Standard crawler analysis, simplified and not easy!". .

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More