Python Pyspider is used as an example to analyze the web crawler implementation method of the search engine.

Last Update:2015-03-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In this article, we will analyze a web crawler.

Web Crawler is a tool that scans Network Content and records its useful information. It can open a lot of web pages, analyze the content of each page to find all the data that interest it, store the data in a database, and then perform the same operation on other web pages.

If the web page being analyzed by the crawler has some links, the crawler will analyze more pages based on these links.

The search engine is implemented based on this principle.

In this article, I chose a stable and "young" open-source project pyspider, which is implemented by binux encoding.

Note: It is believed that pyspider continuously monitors the network. It assumes that the webpage will change after a period of time, so it will re-access the same webpage after a period of time.

Overview

Crawler pyspider consists of four components. Including schedcher, fetcher, processor, and a monitoring component.

The scheduler accepts the task and determines what to do. There are several possibilities: It can discard a task (maybe this specific webpage has just been crawled), or assign different priorities to the task.

After the priorities of each task are determined, they are passed into the crawler. It crawls web pages again. This process is complicated, but it is logically simple.

When resources on the network are captured, the content processing program is responsible for extracting useful information. It runs a user-written Python script, which is not isolated like a sandbox. It also includes capturing exceptions or logs and managing them as appropriate.

Finally, the crawler pyspider has a monitoring component.

The crawler pyspider provides an exceptionally powerful web ui that allows you to edit and debug your scripts, manage the entire crawling process, monitor ongoing tasks, and finally output results.

Projects and tasks

In pyspider, we have the concepts of projects and tasks.

A task refers to a separate page that needs to be retrieved and analyzed from the website.

A project refers to a larger entity, including all the pages involved by crawlers, python scripts required for analyzing webpages, and databases used to store data.

In pyspider, we can run multiple projects at the same time.

Code Structure Analysis

Root directory

The following folders can be found in the root directory:

Data, an empty folder that stores the data generated by crawlers.
Docs, which contains the project documentation, contains some markdown code.
Pyspider, which contains the actual code of the project.
Test, which contains a lot of test code.
Here I will focus on some important documents:
. Travis. yml, a great integration of Continuous testing. How do you determine whether your project is valid? After all, testing on your own machine with a fixed version of the library is not enough.
Dockerfile is also a great tool! If I want to try a project on my machine, I only need to run Docker and I do not need to install anything manually. This is a good way for developers to participate in your project.
LICENSE is required for any open-source project (if you have an open-source project). Do not forget the file in your project.
Requirements.txt: In the Python world, this file is used to indicate what Python packages need to be installed in your system to run the software. This file is required in any Python project.
Run. py, the main entry point of the software.
Setup. py, which is a Python script used to install the pyspider project in your system.

After analyzing the root directory of the project, only the root directory indicates that the project is developed in a very professional way. If you are developing any open-source program, we hope you can achieve this level.

Folder pyspider

Let's go deeper and analyze the actual code together.

Other folders can be found in this folder, and the logic behind the entire software has been split to facilitate management and expansion.

These folders are: database, fetcher, libs, processor, result, scheduler, and webui.

In this folder, we can also find the main entry point of the entire project, run. py.

File run. py

This file first completes all necessary chores to ensure that the crawler runs successfully. Eventually it generates all the necessary computing units. Scroll down to view the entry point of the entire project, cli ().

Function cli ()

This function seems complicated, but with me, you will find that it is not as complicated as you think. The main purpose of the function cli () is to create all connections between the database and the message system. It mainly parses command line parameters and creates a large dictionary using all the things we need. Finally, we start the real work by calling the function all.

Function all ()

A Web Crawler performs a lot of IO operations, so a good idea is to generate different threads or sub-processes to manage all of these tasks. In this way, you can extract useful information from the previous page while waiting for the network to get your current html page.

Function all () determines whether to run the sub-process or thread, and then calls all the necessary functions in different threads or sub-processes. In this case, pyspider will generate a sufficient number of threads required by all logic modules of the crawler, including webui. When we finish the project and close webui, we will close every process cleanly and beautifully.

Now our crawler is running. Let's explore it more deeply.

Scheduler

The scheduler obtains tasks (newtask_queue and status_queue) from two different queues and adds the tasks to another queue (out_queue). The queue will be read by the capture program later.

The first thing the scheduler does is load all the tasks that need to be completed from the database. Then it starts an infinite loop. Several methods will be called in this loop:

1. _ update_projects (): attempts to update various settings. For example, we want to adjust the crawling speed when crawlers are working.

2. _ check_task_done (): analyzes completed tasks and saves them to the database. It obtains the tasks from status_queue.

3. _ check_request (): If the content handler needs to analyze more pages, place these pages in the queue newtask_queue, and the function will obtain new tasks from the queue.

4. _ check_select (): Add a new webpage to the queue of The crawler.

5. _ check_delete (): delete user-marked tasks and projects.

6. _ try_dump_cnt (): records the number of completed tasks in a file. It is necessary to prevent data loss caused by program exceptions.

def run(self):  while not self._quit:   try:    time.sleep(self.LOOP_INTERVAL)    self._update_projects()    self._check_task_done()    self._check_request()    while self._check_cronjob():     pass    self._check_select()    self._check_delete()    self._try_dump_cnt()    self._exceptions = 0   except KeyboardInterrupt:    break   except Exception as e:    logger.exception(e)    self._exceptions += 1    if self._exceptions > self.EXCEPTION_LIMIT:     break    continue

Loop checks exceptions during running or whether python is required to stop processing.

finally:  # exit components run in subprocess  for each in threads:   if not each.is_alive():    continue   if hasattr(each, 'terminate'):    each.terminate()   each.join()

Capture programs

The purpose of capturing a program is to retrieve network resources.

Pyspider can process common HTML text pages and AJAX-based pages. It is important to understand this difference only when the capturing program is aware of it. We will only focus on html text capturing. However, most of the ideas can be easily transplanted to Ajax to capture samples.

The idea here is similar to a scheduler in some form. We have two queues for input and output respectively, and a large loop. For all elements in the input queue, capture the program to generate a request and put the result into the output queue.

It sounds simple, but there is a big problem. The network is usually extremely slow. If all computing tasks are blocked by waiting for a webpage, the entire process will be extremely slow. The solution is very simple, that is, do not block all computing while waiting for the network. This idea is to send a large number of messages on the network, and a considerable number of messages are sent at the same time, and then asynchronously wait for the response to return. Once we take back a response, we will call another callback function, and the callback function will manage the response in the most appropriate way.

All the complex asynchronous Scheduling in crawler pyspider is implemented by another excellent open-source project.

http://www.tornadoweb.org/en/stable/

Complete.

Now we have an excellent idea in our minds, so that we can explore more deeply how this is achieved.

def run(self): def queue_loop():  if not self.outqueue or not self.inqueue:   return  while not self._quit:   try:    if self.outqueue.full():     break    task = self.inqueue.get_nowait()    task = utils.decode_unicode_obj(task)    self.fetch(task)   except queue.Empty:    break tornado.ioloop.PeriodicCallback(queue_loop, 100, io_loop=self.ioloop).start() self._running = True self.ioloop.start()<strong>

Function run () </strong>

The run () function is a large loop program in fetcher.

The run () function defines another function, queue_loop (), which receives all the tasks in the input queue and captures them. This function also monitors the interrupt signal. The queue_loop () function is the class PeriodicCallback that is passed to tornado as a parameter. As you guess, PeriodicCallback calls the queue_loop () function at a specific interval. The queue_loop () function also calls another function that can bring us closer to the actual retrieval of Web Resource Operations: fetch ().

Function fetch (self, task, callback = None)

Resources on the network must be retrieved using the phantomjs_fetch () or simple http_fetch () function. The fetch () function only determines the correct method to retrieve the resource. Next let's take a look at the function http_fetch ().

Function http_fetch (self, url, task, callback)

def http_fetch(self, url, task, callback): '''HTTP fetcher''' fetch = copy.deepcopy(self.default_options) fetch['url'] = url fetch['headers']['User-Agent'] = self.user_agent  def handle_response(response):  ...  return task, result  try:  request = tornado.httpclient.HTTPRequest(header_callback=header_callback, **fetch)     if self.async:   self.http_client.fetch(request, handle_response)  else:   return handle_response(self.http_client.fetch(request))

Finally, this is the place to complete the real work. The code of this function is a bit long, but it has a clear structure and is easy to read.

At the beginning of the function, it sets the header to capture the request, such as User-Agent and timeout. Then define a function to process the response: handle_response (). We will analyze this function later. Finally, we get a request object request from tornado and send this request object. Note how to use the same function to handle response in asynchronous and non-asynchronous scenarios.

Let's look back and analyze what the handle_response () function has done.

Function handle_response (response)

def handle_response(response): result = {} result['orig_url'] = url result['content'] = response.body or '' callback('http', task, result) return task, result

This function stores all related information of a response in a dictionary, such as url, status code, and actual response, and then calls the callback function. The callback function here is a small method: send_result ().

Function send_result (self, type, task, result)

def send_result(self, type, task, result): if self.outqueue:  self.outqueue.put((task, result))

The final function puts the result into the output queue and waits for the content processing program processor to read.

Content processing program processor

The purpose of the content handler is to analyze the page that has been crawled. The process is also a large loop, but there are three queues (status_queue, newtask_queue, and result_queue) in the output, and only one queue (inqueue) in the input ).

Let's take a deeper look at the loop process in the function run.

Function run (self)

def run(self): try:  task, response = self.inqueue.get(timeout=1)  self.on_task(task, response)  self._exceptions = 0 except KeyboardInterrupt:  break except Exception as e:  self._exceptions += 1  if self._exceptions > self.EXCEPTION_LIMIT:   break  continue

This function has less Code and is easy to understand. It obtains the next task to be analyzed from the queue and analyzes it using the on_task (task, response) function. This loop listens for the interrupt signal. As long as we send this signal to Python, this loop will terminate. At last, this cycle counts the number of exceptions it raises. If the number of exceptions is too large, this cycle is terminated.

On_task (self, task, response)

def on_task(self, task, response): response = rebuild_response(response) project = task['project'] project_data = self.project_manager.get(project, updatetime) ret = project_data['instance'].run(  status_pack = {  'taskid': task['taskid'],  'project': task['project'],  'url': task.get('url'),  ...  } self.status_queue.put(utils.unicode_obj(status_pack)) if ret.follows:  self.newtask_queue.put(   [utils.unicode_obj(newtask) for newtask in ret.follows])  for project, msg, url in ret.messages:  self.inqueue.put(({...},{...}))  return True

The on_task () function is actually a method to work.

It tries to use the input task to find the project to which the task belongs. Then it runs the custom script in the project. Finally, it analyzes the response returned by the custom script. If everything goes well, we will create a dictionary containing all the information we get from the web page. Finally, put the dictionary in the queue status_queue. Later, it will be reused by the scheduler.

If some new links need to be processed on the Analysis page, the new links will be put into the queue newtask_queue and will be used by the scheduler later.

Now, if necessary, pyspider will send the results to other projects.

Finally, if some errors occur, such as the error returned on the page, the error information will be added to the log.

End!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Pyspider is used as an example to analyze the web crawler implementation method of the search engine.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Pyspider is used as an example to analyze the web crawler implementation method of the search engine.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support