Web crawlers and the algorithms and data structures they use

Source: Internet
Author: User



the advantages and disadvantages of web crawler, to a large extent, reflects a good search engine poor. Do not believe, you can take a Web site to inquire about the search for its web page, the crawler's strong degree and the search engine is basically proportional to the quality.

1. The world's simplest reptile--three quotes poetry

Let's take a look at one of the simplest and simplest crawlers, written in Python, with just three lines.
Import Requestsurl = "Http://www.cricode.com" R = Requests . Get (url)
above these three line crawler program, as below these three quotes poetry general, very crisp.

is a good man,

You should be arguing with your girlfriend.

hold the mentality of losing.
2. A normal reptile program

The simplest reptile on the top is an incomplete, disabled reptile. Because the crawler usually needs to do things like this:
    • 1) Given the seed URLs, the crawler will crawl all the torrent URL pages to remove
    • 2) The crawler resolves the links in the crawled URL page, placing the links in the collection of URLs to crawl
    • 3) Repeat 1 or 2 steps until the specified condition is reached to end the crawl
So, a complete reptile is probably like this:
ImportRequests#用来爬取网页  fromBS4Import BeautifulSoup         #用来解析网页seds= ["Http://www.hao123.com",      #我们的种子               "Http://www.csdn.net",               "Http://www.cricode.com"]sum= 0                               #我们设定终止条件为: When you crawl to 100,000 pages, you're not playing.    whilesum< 10000 :     ifsum<Len(seds):R=Requests.Get(seds[sum])sum=sum+ 1do_save_action(R)Soup= BeautifulSoup(R.content)URLs=Soup.Find_all("href",.....)                     //Parse Web page           forURLinchURLs:seds.Append(URL)      Else:           Break
3. Now come to the fault

above that complete reptile, less than 20 lines of code, I believe you can find out 20 stubble. Because of its shortcomings is too much. Here is a list of its N crimes:
    • 1) Our task is to crawl 10,000 pages, according to the above program, a person in the silently crawl, assuming that a page 3 seconds to crawl, then, crawl 10,000 pages need 30,000 seconds. MGD, we should consider opening multiple threads (pools) to crawl together, or using a distributed architecture to crawl Web pages concurrently.
    • 2) The URL of the seed URL and subsequent parsing to the URLs are placed in a list, we should design a more reasonable data structure to store the URLs to be crawled, such as queue or priority queue.
    • 3) The URLs of each website, we treat each other equally, in fact, we should be treated differently. The principle of good station priority should be considered.
    • 4) Each time a request is made, we initiate the request based on the URL, and this process involves DNS resolution, which translates the URL into an IP address. A web site is usually made up of thousands of URLs, so we can consider caching the IP addresses of these web domain names to avoid the hassle of initiating DNS requests every time.
    • 5) After parsing the URLs in the Web page, we did not do any redo and put them all in the list to be crawled. In fact, there may be many links that are repetitive and we do a lot of repetitive work.
    • 6) .....
4. After so many stubble, there is a sense of accomplishment, the real problem comes, learning excavator in the end which strong?

now let's go to one by one to discuss the solution to some of the problems identified above.

1) Parallel crawl problem

we can have multiple ways to achieve parallelism.

multi-threaded or thread pool mode, a crawler internal open multiple threads. The same machine turns on multiple crawlers, so we have n multiple crawl threads working at the same time. Can greatly reduce the time.

In addition, when we have to crawl a very long task, a machine, a network point is certainly not enough, we must consider the distributed crawler. Common distributed architectures are: Master-slave (Master--slave) architecture, peer-to-peer architecture, hybrid architecture, and so on.

Speaking of distributed architecture, we need to consider a lot of problems, we need to assign tasks, each crawler need communication cooperation, work together to complete the task, do not repeatedly crawl the same page. Assigning tasks to be fair and impartial, we need to consider how to load balance. Load balancing, we first think of is a hash, for example, according to the site domain name hash.

Load Balancer After the task is dispatched, do not think it is all right, in case of which machine is hung up? Who was assigned to the task that was originally assigned to which machine was hung off? Or how many more machines will be added in the day, and how should the task be reassigned?

a better solution is to use a consistent hash algorithm.

2) to crawl the page queue

how to treat the queue to be crawled is a similar scenario to how the operating system schedules the process.

different sites are different in importance, so you can design a priority queue to hold links to pages that you want to crawl. As a result, every time we crawl, we take precedence over important pages.

of course, you can also follow the process scheduling strategy of the operating system multilevel feedback queue scheduling algorithm.

3) DNS caching

to avoid initiating DNS queries every time, we can cache DNS. The DNS cache is of course designing a hash table to store the existing domain name and its IP.

4) Web page to go heavy

when it comes to Web pages, the first one to think about is junk mail filtering. A classic solution for spam filtering is the Bloom filter (Bron filter). The Bron filter principle is simply: Create a large bit array, then hash the same URL with multiple hash functions to get multiple numbers, and then position the digits in the bit array at 1. Next time a URL, the same is used to hash the hash function, to get multiple numbers, we only need to determine the bit array of these numbers corresponding to is all 1, if all is 1, then the URL has appeared. This completes the URL-heavy issue. Of course, this method will have errors, as long as the error in our tolerance range, such as 10,000 pages, I only climbed to 9,999, the rest of the page, who cares!

5) Problems with data storage

data storage is also a very technical issue. It is great article to use relational databases to access or to use NoSQL, or to design specific file formats for storage.

6) Inter-process communication

distributed crawler, it is necessary to rely on inter-process communication. We can interact with data in the specified data format to complete interprocess communication.

7) ...
 
nonsense said so much, the real problem comes, the problem is not to learn the excavator in the end which strong? But how to implement these things! :)

in the process of implementation, you will find that we have to consider more than that. The paper came to the end of shallow, aware of this matter to preach!

Web crawlers and the algorithms and data structures they use

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.