"The beauty of Mathematics", the 9th chapter of graph theory and web crawler

Source: Internet
Author: User
Tags hash
1 graph theory

The origins of graph theory can be traced back to the era in which the great mathematician Euler was located.

The graphs in graph theory are composed of some nodes and the arcs that connect these nodes.

Breadth-First search (Breadth-first search, BFS)

Depth-First search (Depth-first search, referred to as DFS) 2 web crawler

In a web crawler, people use a "hash table", or a hash table, rather than a notepad to record whether a webpage has downloaded information.

Now the internet is very large, and will not be able to use one or several computer servers to complete the download task. A commercial web crawler requires thousands of servers and is connected through a high-speed network.
a two-point supplement to the 3 graph theory 3.1 The proof of Euler's seven-bridge problem

For each vertex in the diagram, the number of edges connected to it is defined as its degree.

Theorem: If a graph is capable of starting from a vertex, and each edge does not iterate over it again and again, the degree of each vertex must be an even number.

Proof: If you can facilitate each edge of the diagram once, then for each vertex, you need to go from one edge to the vertex, while leaving the vertex from the other side. The number of times the vertex is entered and left is the same, so how many edges each vertex has to enter, and how many are going out of the way. That is, the number of edges connected to each vertex is paired, i.e. the degree of each vertex must be even.
3.2 Engineering Essentials of constructing web crawler

First, use BFS or DFS.

The web crawler's order of Web traversal is not simply DFS and BFS, but rather a relatively complex method of prioritizing downloads.

The subsystem that manages this priority ranking is generally called the dispatch system, which determines which to download when a webpage is downloaded.

In the reptile, BFS has more ingredients.

Second, page analysis and URL extraction

If you find some pages exist, but the search engine is not included, one possible reason is that the web crawler parser did not successfully parse the Web page in the non-standard script program.

Third, record which pages have downloaded the Small Books--url table

To prevent a webpage from being downloaded more than once, we can use a hash table to record which pages have been downloaded, and then we can skip it when we encounter this page.

How to solve the traffic of the server that stores the hash list becomes the bottleneck of the whole reptile system.

First, clear the division of work for each download server.

Then, on the basis of a clear division of labor, to determine whether the URL is downloaded can be batch processing, such as sending a batch of queries to the hash list each time, or update a large number of hash list of content.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.