In this article, we will analyze a web crawler.
A web crawler is a tool that scans web content and records its useful information. It can open up a bunch of pages, analyze the contents of each page to find all the interesting data, store the data in a database, and do the same for other pages.
If there are links in the Web page that the crawler is analyzing, then the crawler will analyze more pages based on the links.
The search engine is based on the principle of such a realization.
In this article, I specifically selected a stable, "young" open Source project Pyspider, which is implemented by Binux encoding.
Note: Pyspider is believed to be continuously monitoring the network, which assumes that the Web page will change over time, so it will revisit the same Web page after some time.
Overview
The reptile Pyspider consists mainly of four components. Includes a scheduler (scheduler), a crawler (fetcher), a content handler (processor), and a monitoring component.
The scheduler accepts tasks and decides what to do. There are several possibilities, it can discard a task (perhaps the particular page has just been crawled), or assign a different priority to the task.
When the priority of each task is determined, they are passed into the crawler. It crawls the Web page again. The process is complex, but logically simpler.
When resources on the network are crawled, the content handlers are responsible for extracting useful information. It runs a user-written Python script that is not isolated like a sandbox. Its responsibilities also include capturing exceptions or logs, and managing them appropriately.
Finally, there is a monitoring component in the crawler pyspider.
Crawler Pyspider provides an exceptionally powerful web interface (Web UI) that allows you to edit and debug your scripts, manage the entire crawl process, monitor ongoing tasks, and ultimately output results.
Projects and Tasks
In Pyspider, we have the concept of projects and tasks.
A task refers to a separate page that needs to be retrieved from the site and analyzed.
An item refers to a larger entity, which includes all the pages that the crawler involves, the Python script needed to analyze the page, and the database for storing the data, and so on.
In Pyspider we can run multiple projects at the same time.
Code Structure Analysis
root directory
The folders you can find in the root directory are:
- Data, an empty folder, which is the place to store information generated by the crawler.
- Docs, which contains the project documentation, has some markdown code inside it.
- Pyspider, which contains the actual code for the project.
- Test, which contains quite a lot of testing code.
- Here I will focus on some important documents:
- . travis.yml, a great integration of continuity testing. How do you make sure your project is working? After all, testing on your own machine with a fixed version of the library is not enough.
- Dockerfile, the same great tool! If I want to try a project on my machine, I just need to run Docker and I don't need to install anything manually, which is a great way to get developers involved in your project.
- LICENSE is required for any open source project (if you own an open source project), do not forget the file in your project.
- Requirements.txt, in the Python world, this file is used to indicate what Python packages need to be installed in your system in order to run the software, which is required in any Python project.
- run.py, the main entry point for the software.
- setup.py, the file is a Python script that installs the Pyspider project on your system.
The root directory of the project has been analyzed, and only the root directory will show that the project was developed in a very professional way. If you are developing any open source program, I hope you can reach that level.
Folder Pyspider
Let's go a bit deeper and analyze the actual code together.
Other folders can be found in this folder, and the logic behind the entire software has been segmented to make it easier to manage and expand.
These folders are: Database, Fetcher, LIBS, processor, result, scheduler, WebUI.
In this folder we can also find the main entry point of the whole project, run.py.
File run.py
This file first completes all the necessary chores to ensure that the crawler runs successfully. Eventually it produces all the necessary computational units. Scroll down we can see the entire project entry point, CLI ().
function CLI ()
This function seems complicated, but with me, you'll find it's not as complicated as you think. The primary purpose of the function cli () is to create all connections to the database and the messaging system. It mainly parses the command-line arguments and creates a large dictionary with all the things we need. Finally, we start the real work by calling the function all ().
function All ()
A web crawler makes a lot of IO operations, so a good idea is to produce different threads or sub-processes to manage all of this work. In this way, you can extract useful information from the previous page while waiting for the network to get your current HTML page.
The function all () determines whether the child process or thread is run, and then calls all the necessary functions in different threads or sub-processes. At this point Pyspider will produce a sufficient number of threads for all of the crawler's logic modules, including the WebUI. When we finish the project and close the WebUI, we will cleanly and beautifully close each process.
Now that our crawlers are running, let's take a bit more in-depth exploration.
Dispatch program
The Scheduler fetches tasks (Newtask_queue and Status_queue) from two different queues and joins the task to another queue (Out_queue), which is later read by the crawler.
The first thing the scheduler does is load all the tasks that need to be done from the database. After that, it begins an infinite loop. Several methods are called in this loop:
1._update_projects (): Try to update the various settings, for example, we want to adjust the crawl speed when the crawler is working.
2._check_task_done (): Parses the completed task and saves it to the database, which gets the task from the Status_queue.
3._check_request (): If the content handler requires that more pages be parsed and placed in queue Newtask_queue, the function will get new tasks from that queue.
4._check_select (): Adds a new Web page to the queue of the crawler.
5._check_delete (): Deletes tasks and items that have been flagged by the user.
6._TRY_DUMP_CNT (): Records the number of tasks that have been completed in a file. This is necessary to prevent the loss of data caused by program exceptions.
def run: self._quit: try: time.sleep (self). Loop_interval) self._update_projects () self._check_task_done () self._check_request () while self . _check_cronjob (): pass self._check_select () self._check_delete () self._try_dump_cnt () self._exceptions = 0 except Keyboardinterrupt: break except Exception as E: logger.exception (E) self._exceptions + = 1 if self._exceptions > self. Exception_limit: Break continue
The loop also checks for exceptions during the run, or whether we require Python to stop processing.
Finally: # Exit components run in subprocess for each in threads: if not each.is_alive (): continue< C24/>if hasattr (each, ' terminate '): each.terminate () each.join ()
Grab Program
The purpose of the crawler is to retrieve network resources.
Pyspider can handle plain HTML text pages and AJAX-based pages. It is important to understand that only the crawler is aware of this discrepancy. We will focus only on normal HTML text fetching, but most of the ideas can be easily ported to Ajax crawlers.
The idea here is in some form similar to the scheduler, where we have two queues for input and output, and a large loop. For all elements in the input queue, the crawler generates a request and puts the result in the output queue.
It sounds simple but has a big problem. The network is usually extremely slow, and if all calculations are blocked by waiting for a Web page, the entire process will run extremely slowly. The workaround is very simple, that is, do not block all computations while waiting for the network. The idea is to send a large number of messages on the network, and a significant portion of the message is sent at the same time, and then asynchronously waits for the response to return. Once we take back a response, we will invoke another callback function, and the callback function will manage such a response in the most appropriate way.
All of the complex asynchronous schedules in the crawler pyspider are made by another excellent open source project
http://www.tornadoweb.org/en/stable/
Complete.
Now we have an excellent idea in mind, so let's explore how this is done in more depth.
def run: def queue_loop (): if not self.outqueue or not self.inqueue: return and not self._quit: Try: if Self.outqueue.full (): break task = self.inqueue.get_nowait () task = Utils.decode_unicode_ Obj (Task) Self.fetch (Task) except queue. Empty: Break tornado.ioloop.PeriodicCallback (Queue_loop, Io_loop=self.ioloop). Start () self._running = True Self.ioloop.start ()
function Run ()
The function run () is a large loop program in the fetcher of the crawler.
The function run () defines another function, Queue_loop (), that receives all the tasks in the input queue and fetches them. The function also listens for interrupt signals. The function Queue_loop () is passed as a parameter to the class periodiccallback of tornado, and as you can guess, Periodiccallback calls the Queue_loop () function at a specific time every once in a while. The function Queue_loop () also invokes another function that will bring us closer to actually retrieving the operation of the Web resource: Fetch ().
function fetch (self, task, callback=none)
The resources on the network must be retrieved using the function Phantomjs_fetch () or the simple Http_fetch () function, and the function fetch () only determines what is the correct way to retrieve the resource. Next we look at the function Http_fetch ().
function Http_fetch (self, URL, task, callback)
def http_fetch (self, URL, task, callback): ' http fetcher ' ' fetch = copy.deepcopy (self.default_options) fetch[' url '] = u RL fetch[' headers ' [' user-agent '] = self.user_agent def handle_response (response): ... Return task, result try: request = Tornado.httpclient.HTTPRequest (Header_callback=header_callback, **fetch) if Self.async: self.http_client.fetch (Request, Handle_response) else: return Handle_response ( Self.http_client.fetch (Request))
Finally, this is where the real work is done. The code for this function is a bit long, but it has a clear structure and is easy to read.
At the beginning of the function, it sets the header of the fetch request, such as user-agent, time-out timeout, and so on. Then define a function that handles the response response: Handle_response (), which we'll analyze later. Finally we get an Tornado request object and send the Request object. Notice how the same function is used to handle the response response in both asynchronous and non-asynchronous situations.
Let's look back and analyze what the function Handle_response () did.
function Handle_response (response)
def handle_response (response): result = {} result[' orig_url '] = URL result[' content '] = Response.body or ' Callback (' http ', task, result) return task, result
This function saves all relevant information of a response in the form of a dictionary, such as URLs, status codes, and actual responses, and then invokes the callback function. The callback function here is a small method: Send_result ().
function Send_result (self, type, task, result)
def send_result (self, type, task, result): If Self.outqueue: self.outqueue.put ((task, result))
This last function puts the result in the output queue, waiting for the content handler to read processor.
Content Handler Processor
The purpose of the content handler is to analyze the pages that have been crawled back. Its process is also a large loop, but there are three queues in the output (Status_queue, Newtask_queue, and Result_queue) and only one queue (inqueue) in the input.
Let's take a little deep look at the looping process in function run ().
function run (self)
def run (self): try: task, response = Self.inqueue.get (timeout=1) self.on_task (Task, response) self._ exceptions = 0 except Keyboardinterrupt: Break except Exception as E: self._exceptions + = 1 if Self._exceptio ns > Self. Exception_limit: Break continue
The code for this function is relatively small and easy to understand, it simply gets the next task that needs to be parsed from the queue and analyzes it using the On_task (task, response) function. This loop listens for interrupt signals, and as long as we send this signal to Python, the loop terminates. The last loop counts the number of exceptions it throws, and an excessive number of exceptions will terminate the loop.
function On_task (self, task, response)
def on_task (self, Task, response): Response = Rebuild_response (response) Project = task[' project ' Project_data = Self.pro Ject_manager.get (Project, updatetime) ret = project_data[' instance '].run ( status_pack = { ' taskid ': task[') TaskID '], ' project ': task[' project ', ' url ': task.get (' url '), ... } self.status_queue.put ( Utils.unicode_obj (Status_pack)) if ret.follows: self.newtask_queue.put ( [Utils.unicode_obj (NewTask) for NewTask in Ret.follows]) for project, MSG, URL in ret.messages: self.inqueue.put (({...})) Return True
The function On_task () is the real way to work.
It attempts to find the project to which the task belongs by using the input task. It then runs the custom script in the project. Finally it parses the response response returned by the custom script. If all goes well, a dictionary will be created that contains all the information we get from the page. Finally, the dictionary is placed in the queue Status_queue, which is later reused by the scheduler.
If there are some new links in the analyzed page that need to be processed, the new links are put into the queue Newtask_queue and later used by the scheduler.
Now, if necessary, Pyspider will send the results to other projects.
Finally, if some errors occur, such as a page return error, the error message is added to the log.
End!