In this article, we will analyze a web crawler.
A web crawler is a tool that scans the contents of a network and records its useful information. It opens up a bunch of pages, analyzes the contents of each page to find all the interesting data, stores the data in a database, and then does the same thing with other pages.
If there are links in the Web page that the crawler is analyzing, the crawler will analyze more pages based on those links.
Search engine is based on this principle to achieve.
In this article, I specifically selected a stable, "young" open Source project Pyspider, which was implemented by the Binux code.
Note: It is thought that Pyspider continues to monitor the network, and it assumes that the page will change over time, so it will regain access to the same Web page after a period of time.
Overview
The reptile Pyspider is composed mainly of four components. Includes a scheduler (scheduler), a crawler (fetcher), a content handler (processor), and a monitoring component.
The scheduler accepts the task and decides what to do. There are several possibilities, it can discard a task (perhaps this particular page has just been crawled), or assign a different priority to the task.
When the priority of each task is determined, they are passed in to the crawler. It crawls the page again. The process is complex, but logically simpler.
When the resources on the network are crawled down, the content handler is responsible for extracting the useful information. It runs a user-written Python script that is not quarantined like a sandbox. Its responsibilities also include capturing exceptions or logs, and managing them appropriately.
Finally, there is a monitoring component in the crawler pyspider.
The crawler Pyspider provides an unusually powerful web interface, which allows you to edit and debug your scripts, manage the entire crawl process, monitor ongoing tasks, and ultimately output results.
Projects and Tasks
In Pyspider, we have the concept of projects and tasks.
A task refers to a separate page that needs to be retrieved from the Web site and analyzed.
A project refers to a larger entity, which includes all the pages involved in the crawler, the Python scripts needed to analyze the pages, and the database used to store the data, and so on.
In Pyspider we can run multiple projects at the same time.
Code Structure Analysis
root directory
The folders that can be found in the root directory are:
- Data, an empty folder, is the place where the information generated by the crawler is stored.
- Docs, contains the documentation for the project, with some markdown code inside.
- Pyspider, which contains the actual code for the project.
- Test, which contains quite a few code.
- Here I will focus on some important documents:
- . travis.yml, a great integration of continuity testing. How do you make sure that your project is really effective? After all, it's not enough to test your own machine with a fixed version of the library.
- Dockerfile, the same great tool! If I want to try a project on my machine, I just need to run Docker, and I don't need to install anything manually, which is a great way to get developers involved in your project.
- LICENSE is necessary for any open source project (if you have an open source project yourself) Don't forget the file in your project.
- Requirements.txt, in the Python world, the file is used to indicate what Python packages need to be installed in your system in order to run the software, which is required in any Python project.
- run.py, the main entry point for the software.
- setup.py, this file is a Python script for installing the Pyspider project on your system.
The root directory of the project has been analyzed, and only the root directory shows that the project was developed in a very professional way. If you are developing any Open-source program, I hope you can achieve this level.
Folder Pyspider
Let's go a bit deeper and analyze the actual code together.
Other folders can be found in this folder, and the logic behind the entire software has been split to make it easier to manage and expand.
These folders are: Database, Fetcher, LIBS, processor, result, scheduler, WebUI.
In this folder we can also find the main entry point for the entire project, run.py.
File run.py
This file first completes all the necessary chores to ensure that the crawler runs successfully. Eventually it produces all the necessary calculated cells. Scrolling down we can see the entire entry point of the project, CLI ().
function CLI ()
This function seems very complicated, but with me, you will find that it is not as complex as you think. The main purpose of the function cli () is to create all connections between the database and the messaging system. It mainly parses the command-line arguments and uses all the things we need to create a large dictionary. Finally, we start the real work by calling the function all ().
function All ()
A web crawler can do a lot of IO operations, so a good idea is to produce different threads or subroutines to manage all of this work. In this way, you can extract useful information from the previous page while waiting for the web to get your current HTML page.
function All () determines whether to run a subprocess or thread, and then call all the necessary functions in a different thread or child process. At this point Pyspider will produce a sufficient number of threads for all the logical modules of the reptile, including the WebUI, as required. When we finish the project and close the WebUI, we will close each process cleanly and beautifully.
Now our reptiles are starting to run, and let's do a little deeper exploration.
Scheduling program
The Scheduler Gets the task (Newtask_queue and Status_queue) from two different queues and joins the task to another queue (Out_queue), which is later read by the crawler.
The first thing the scheduler does is load all the tasks you need to complete from the database. After that, it begins an infinite loop. Several methods are called in this loop:
1._update_projects (): Try to update the various settings, for example, we want to adjust the crawl speed while the crawler is working.
2._check_task_done (): Analyzes the completed task and saves it to the database, which gets the task from the Status_queue.
3._check_request (): If the content handler requires more pages to be parsed, put the pages in the queue Newtask_queue, the function will get a new task from that queue.
4._check_select (): Add new Web pages to the crawl queue.
5._check_delete (): Deletes the tasks and items that have been flagged by the user.
6._TRY_DUMP_CNT (): Records the number of completed tasks in a file. This is necessary to prevent data loss due to program exceptions.
def run (self): While not
Self._quit:
try:
time.sleep (self. Loop_interval)
self._update_projects ()
self._check_task_done () self._check_request () while
self . _check_cronjob ():
pass
self._check_select ()
self._check_delete ()
self._try_dump_cnt
() self._exceptions = 0
except Keyboardinterrupt:
break
except Exception as E:
logger.exception (E)
self._exceptions + + 1
if self._exceptions > self. Exception_limit:
Break
continue
The loop also checks for exceptions during the run, or if we ask Python to stop processing.
Finally:
# Exit components run in subprocess to
threads: if not
each.is_alive ():
continue< C27/>if hasattr (each, ' terminate '):
each.terminate ()
each.join ()
Crawl Programs
The purpose of the crawler is to retrieve network resources.
Pyspider is able to handle plain HTML text pages and AJAX based pages. It is important to understand this difference only if the crawler realizes it. We'll focus on just plain HTML text crawling, but most of the ideas can be easily ported to Ajax crawlers.
The idea here is similar to the scheduler in some form, we have two queues for input and output, and a large loop. For all elements in the input queue, the crawler generates a request and puts the result in the output queue.
It sounds simple but has a big problem. The network is usually extremely slow, and if all the calculations are blocked by waiting for a Web page, the whole process will run very slowly. The solution is very simple: do not block all computations while waiting for the network. The idea is to send a large number of messages on the network, and a considerable portion of the message is sent at the same time, and then asynchronously waiting for the response to return. Once we retract a response, we will invoke another callback function that will manage the response in the most appropriate way.
All the complex asynchronous schedules in the reptile Pyspider are made up of another excellent open source project
http://www.tornadoweb.org/en/stable/
Complete.
Now that we have an excellent idea in mind, let's explore more deeply how this is achieved.
def run (self):
def queue_loop ():
if not self.outqueue or not self.inqueue: return while not
self._quit:
try:
if Self.outqueue.full ():
break
task = self.inqueue.get_nowait ()
task = Utils.decode_ Unicode_obj (Task)
Self.fetch (Task)
except queue. Empty:
break
tornado.ioloop.PeriodicCallback (Queue_loop, Io_loop=self.ioloop). Start ()
self._ running = True
self.ioloop.start ()
<strong>
function run () </strong>
function run () is a large loop program in the crawler fetcher.
function run () defines another function Queue_loop (), which receives all the tasks in the input queue and crawls them. The function also listens for interrupt signals. The function Queue_loop () is passed as a parameter to the Tornado class Periodiccallback, and as you can guess, Periodiccallback calls the Queue_loop () function once every other time. The function Queue_loop () also invokes another function that brings us closer to the actual retrieval of the Web resource operation: Fetch ().
function fetch (self, task, callback=none)
Resources on the network must be retrieved using a function phantomjs_fetch () or a simple Http_fetch () function fetch (), which only determines what the correct method is to retrieve the resource. Next we look at the function Http_fetch ().
function Http_fetch (self, URL, task, callback)
def http_fetch (self, URL, task, callback):
' http fetcher '
fetch = copy.deepcopy (self.default_options)
fetch[' url '] = URL
fetch[' headers ' [' user-agent '] = self.user_agent
def handle_response (response):
...
Return task, result
try:
request = Tornado.httpclient.HTTPRequest (Header_callback=header_callback, **fetch
if Self.async:
self.http_client.fetch (Request, Handle_response)
else: return
Handle_ Response (Self.http_client.fetch (Request))
Finally, this is where the real work is done. The code for this function is a bit long, but has a clear structure and is easy to read.
At the beginning of the function, it sets the header for the fetch request, such as user-agent, timeout timeout, and so on. Then define a function to handle the response response: Handle_response (), and then we'll parse the function. Finally we get a tornado request object and send this Request object. Note how the same function is used to handle response response in both asynchronous and non-asynchronous situations.
Let's look back and analyze what the function Handle_response () did.
function Handle_response (response)
def handle_response (response): Result
= {}
result[' orig_url '] = URL
result[' content ' = Response.body or '
callback (' HTTP ', task, result) return
task, result
This function holds all relevant information about a response in the form of a dictionary, such as URL, status code and actual response, and then invokes the callback function. The callback function here is a small method: Send_result ().
function Send_result (self, type, task, result)
def send_result (self, type, task, result):
if Self.outqueue:
self.outqueue.put (task, result)
This last function puts the results into the output queue, waiting for the content handler to read processor.
Content Handler Processor
The purpose of the content handler is to analyze the pages that have been crawled back. Its process is also a large cycle, but there are three queues (Status_queue, Newtask_queue, and Result_queue) in the output and only one queue (inqueue) in the input.
Let's take a little more in-depth analysis of the loop process in function run ().
function run (self)
def run (self):
try:
task, response = Self.inqueue.get (timeout=1)
self.on_task (Task, response)
self. _exceptions = 0
except Keyboardinterrupt:
break
except Exception as E:
self._exceptions + + 1
if Self._exceptions > Self. Exception_limit:
Break
continue
The code for this function is relatively small, easy to understand, and it simply obtains the next task that needs to be analyzed from the queue and analyzes it using the On_task (task, response) function. This loop listens for the interrupt signal, and as soon as we send this signal to Python, the loop terminates. The last loop counts the number of exceptions it throws, and an excessive number of exceptions terminates the loop.
function On_task (self, task, response)
def on_task (self, Task, response):
response = rebuild_response (response)
project = task[' project ']
Project_data = Self.project_manager.get (Project, updatetime)
ret = project_data[' instance '].run (
status_ Pack = {
' taskid ': task[' TaskID '],
' project ': task[' project ',
' url ': task.get (' url '),
...
}
Self.status_queue.put (Utils.unicode_obj (status_pack))
if ret.follows:
self.newtask_queue.put (
[ Utils.unicode_obj (NewTask) for NewTask in Ret.follows])
for project, MSG, URL in ret.messages:
Self.inqueue.put ({...},{...}
) Return True
function On_task () is the real way to work.
It attempts to find the project to which the task belongs by using the entered task. It then runs the custom script in the project. Finally, it analyzes the response response the custom script returns. If all goes well, a dictionary will be created that contains all the information we get from the Web page. Finally, the dictionary is placed in the queue Status_queue, which is later reused by the scheduler.
If there are some new links to be processed on the parsed page, the new links are put into the queue Newtask_queue and later used by the scheduler.
Now, if necessary, Pyspider will send the results to other projects.
Finally, if a few errors occur, such as a page return error, the error message is added to the log.
End!