I was going to send it last night, and the garden was migrated again ......
Web Crawlers (spider or crawler), as the name suggests, are worms crawling on the Internet. Why is this worm crawling on the Internet? Easy: collect information. In the Internet age, whoever has mastered the information has taken the initiative. I used to think that all the companies that do search are charprofessionals. They spent money to serve the masses. It was so noble that I knew that Google's annual profit was from advertising.Famous saying-- The most expensive thing on the Internet is free, because it can be easily accepted but cannot be discarded. (I think most people will leave the search engine, but it will be hard to get on the Internet)
Okay, let's take a look. We can easily see that the fundamental task of Web Crawlers is to capture data from the Internet and store it in a database or local file system for use.
It seems that the functionality of a web crawler is very simple, but do not underestimate the content contained in this box. Many of the steps in it contain a technical aspect (of course, I am not talking about cainiao very much, and I am afraid of ugliness due to lack of level.) It is an internal implementation diagram of a common crawler:
, Usually a complete fear includes two parts: the scheduling part and the job part. The scheduling part can be seen as a general controller (generally, the general control thread includes a scheduling thread), which completes crawler startup initialization (configuring the number of threads, URL set size, load URL seed, current capture progress, captured set, and other configuration information), run-time scheduling (allocate resources for job threads), aftercare (when the URL set is empty, the policy is adopted, etc.), sometimes it is also responsible for interacting with external systems (such as distributed, a crawler will be used as a job, the scheduling thread needs to interact with the master server ). The job thread is hard to work and complete the task.
We can raise the following questions (they are all superficial, but they are basic for implementing a crawler ):
1. How to request data from the Internet
2. How to Implement the job thread? How many are suitable for startup? How do scheduling threads interact?
3. How to Deal with the extracted URL (crawling Policy )?
4. How to filter the URL?
Next we will discuss it one by one:
1. How to request data from the Internet?
Before discussing this question, let's first think about the question: why can the Browser display the corresponding page as soon as we enter the URL and press enter? Most of the time, the browser initiates a request to the server based on the URL (usually using the HTTP protocol. If you do not understand it, we suggest you look at the handsome Xiao Jia series.Article: Http://www.cnblogs.com/TankXiao/category/415412.html .), Download the corresponding page to the local device, and then the browser interpreter interprets the source code of the page as the graphic elements we see, as shown in.
From the above process, we can think that we only need to simulate the browser to request the server. For Java, this is easy to implement in most cases. Java itself provides the object that implements the request, such as httpurlconnection, in addition, the httpclient sub-project of Apache can also be used to conveniently implement these functions (of course, it is better to be familiar with socket. In some cases, using Socket directly is not only more flexible, in addition, it can make great breakthroughs in performance ). As for the specific implementation of requests, a large amount of information is available on the Internet, but it is reminded that many websites have made great efforts in anti-crawler (ProgramThey love crawling Web Crawler to be crawled by other people, and the crawlers of the program are often not compliant with the robots.txt Protocol and are extremely destructive. They once crawled over a small website. I am very sorry here. I hope your website can be changed to your server for cainiao to practice ......), therefore, familiarity with HTTP and other request protocols will help you get twice the result with half the effort. At least some common requests are okay. We must firmly believe that the Internet is one principle:What you see is mandatory.
Finally, it is recommended that you study fellow cainiao in this field and learn to use the debugger of Chrome/firefix and some packet capture tools (such as fiddler2 and sniffer.
2. How to Implement the job thread? How many are suitable for startup? How do scheduling threads interact?
To implement a good crawler (or even a usable one), we must use multithreading technology, this is also the inevitable Design for implementing an excellent system using Java (You may have used it many times without knowing it ). The implementation details are not described here. For a small crawler, I think there should be at least three parts: scheduling, work, and logs (the reason is that logs are very useful for us to control the crawler status, an uncontrollable program sounds unpleasant ). Scheduling and logs are generally one thread, and multiple worker threads need to be configured. Here, we recommend that small crawlers use the Java thread pool mechanism (why is it not found in Jobsearch? Yes, it can save a lot of effort.
How many worker threads should we start? I don't know. This is a difficult question to answer. We know that the performance of software systems will eventually converge into one word: I/O (disk I/O and network I/O), which is closely related to hardware. The performance gap between a PC and a computer room is often obvious when you run a program that has good robustness. How can this problem be solved?Here we will mention a principle that excellent crawlers should do: High configurability. That is to say, we should change the content that we want to change frequently to configurable, which is common in many open-source software. In this way, we can define the size of the worker thread pool in the configuration file and adjust it according to the running status of the server after deployment.
As for how the worker thread interacts with the scheduling thread, it is actually through the URL set to be crawled (usually Queue) and the URL set to be crawled (for filtering, there are other better alternatives, ). The scheduling thread generally maintains a global URL capture queue, and the worker thread fetches one or more URLs each time. Here, we need to pay special attention to the resource competition (URL) between threads, so the URL queue needs special protection (such as synchronizing variables or using threadlocal ).
Well, the write speed is too slow, so it takes only two hours to write so much. Let's talk about this today. Let's fix the last two questions in another day.
Finally, I shared my favorite article, which is much better than my cainiao. I also read some of the basic principles of hadoop, and I feel that many of its Implementation ideas are deeply inspired by me. You can also understand it if you have nothing to do.
Shared document (copyright owned by its original author): Summary of search engine system learning and development