International - English

Cart Console

Topic Center

Contact Sales

Home > Others

"The beauty of Mathematics", the 9th chapter of graph theory and web crawler

Last Update:2018-07-24 Source: Internet

Author: User

Tags hash

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 graph theory

The origins of graph theory can be traced back to the era in which the great mathematician Euler was located.

The graphs in graph theory are composed of some nodes and the arcs that connect these nodes.

Breadth-First search (Breadth-first search, BFS)

Depth-First search (Depth-first search, referred to as DFS) 2 web crawler

In a web crawler, people use a "hash table", or a hash table, rather than a notepad to record whether a webpage has downloaded information.

Now the internet is very large, and will not be able to use one or several computer servers to complete the download task. A commercial web crawler requires thousands of servers and is connected through a high-speed network.
a two-point supplement to the 3 graph theory 3.1 The proof of Euler's seven-bridge problem

For each vertex in the diagram, the number of edges connected to it is defined as its degree.

Theorem: If a graph is capable of starting from a vertex, and each edge does not iterate over it again and again, the degree of each vertex must be an even number.

Proof: If you can facilitate each edge of the diagram once, then for each vertex, you need to go from one edge to the vertex, while leaving the vertex from the other side. The number of times the vertex is entered and left is the same, so how many edges each vertex has to enter, and how many are going out of the way. That is, the number of edges connected to each vertex is paired, i.e. the degree of each vertex must be even.
3.2 Engineering Essentials of constructing web crawler

First, use BFS or DFS.

The web crawler's order of Web traversal is not simply DFS and BFS, but rather a relatively complex method of prioritizing downloads.

The subsystem that manages this priority ranking is generally called the dispatch system, which determines which to download when a webpage is downloaded.

In the reptile, BFS has more ingredients.

Second, page analysis and URL extraction

If you find some pages exist, but the search engine is not included, one possible reason is that the web crawler parser did not successfully parse the Web page in the non-standard script program.

Third, record which pages have downloaded the Small Books--url table

To prevent a webpage from being downloaded more than once, we can use a hash table to record which pages have been downloaded, and then we can skip it when we encounter this page.

How to solve the traffic of the server that stores the hash list becomes the bottleneck of the whole reptile system.

First, clear the division of work for each download server.

Then, on the basis of a clear division of labor, to determine whether the URL is downloaded can be batch processing, such as sending a batch of queries to the hash list each time, or update a large number of hash list of content.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

wordpress the field database of beauty salons discrete mathematics with graph theory 2nd edition pdf cast of american beauty essentials of discrete mathematics theory of cookery type of numbers in mathematics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"The beauty of Mathematics", the 9th chapter of graph theory and web crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support