Since the advent of the World Wide Web in 1994, the number of Web pages on the Internet has grown exponentially, with hundreds of billions of pages on the internet so far in just more than 20 years. How do I search this huge web page to download pages that have value for a particular scenario? What strategies are used to ensure that pages are not duplicated? How to ensure high concurrent crawling of crawlers? How to extract key points in Web pages and so on, this is the focus of this blog post.
11,000-dimensional network structure analysis
Think of the World Wide Web as a connected graph, each Web page as a node, link to see as an edge, where any of the pages can be linked to other pages, the link is called "Reverse link", this page can also link to other pages, this link is called "Forward link." There are two directions for traversing a Web page, forward traversal and reverse traversal, where the forward traversal is traversing the Web page in the direction of the forward link, and the reverse traversal is traversing the Web page in the direction of the reverse link.
The researchers found that, whether it is a forward traversal or a reverse traversal, the results show a very different effect. Either traverse to a very small collection of pages, or explode into billions of pages, from the experimental results, the researchers found that the World Wide Web has a bow-shaped structure, as shown in.
Fig. 110,000 The butterfly knot structure of the dimensional network [1]
This structure is divided into left and right three parts, the left part becomes "catalog page", that is often said navigation Web page, from this part start to use forward traversal, can at least traverse to the entire page of 3/4, and the use of reverse traversal, you can only traverse to a very small part; The central pages are interconnected pages. This section, whether the use of positive or reverse, can be roughly traversed to all the pages of 3/4, the right page is called "authoritative page", this part of the page is pointed to the central page, this part of the page "recognition" high, is referenced by most pages, it is obvious that this part of the Web page traversal and left page is symmetric; Butterfly "Foot" section, this section of the page is displayed as a link from the left to other pages, or from the left or right to link directly to the right, and a small part of the middle, left or right, there is no link, on this part of the page, regardless of the use of forward traversal or reverse traversal can only traverse to a limited number of pages.
After the above analysis, we can conclude that the crawler should start from the left part of the butterfly type as far as possible, or begin traversing from the Web page in the middle.
1.2 Web crawler
Web crawler is one of the most important parts of Web page collection system, it is the basis of search engine work. This section describes the basic concepts of web crawlers, the architecture of distributed crawlers, some strategies for crawling web pages, and robots protocols.
1.2.1 Reptile Concept
Crawler, which downloads a Web page, analyzes the links, and then visits other links to the page, the cycle, until the disk full or manual intervention, the function of the crawler boils down to two points: Download Web page and Discovery URL.
Crawler access to Web pages and browsers in the same way, also is the use of HTTP protocol and Web server interaction. The process is mainly as follows [2]:
1. The client program (BOT) is connected to a DNS server. DNS servers convert host names to IP addresses, because bots frequently query DNS servers, so they can cause side effects similar to denial-of-service attacks (DOS), so the DNS cache is increased in many crawler implementations to reduce the number of unnecessary DNS server queries. In the commercial search engine, the general will build its own DNS server. 2. After the connection is established, the client sends an HTTP request to the Web server to request a page. A common HTTP request is a GET request, such as a Get http://www.sina.com.cn/index.htmlHTTP/1.1, which indicates that the request server uses the HTTP 1.1 protocol specification version 1.1 and www.sina.com.cn/the page The index.html page is returned to the client. Of course the client can also access the network server using the Post command. Similarly, the crawler will frequently use the GET command, and the use of the command will return all the content of the page to the client, the Web page re-visit (because many pages will be updated, so you need to revisit the page to get the latest Web content), try to use the head command to access the server, This command requires that the head part of the Web page be returned to the client, which contains the last updated time (last-modified) field of the Web page, which avoids a large number of downloads of pages that have not been updated after this field in the database. 3. Analyze the URL links in the Web page, insert them into a queue, and extract important content from the Web page into the database. The main feature of the queue is FIFO, each time a newly discovered URL is inserted into the tail of the queue, and then get the next access URL from the head of the queue, so that the loop repeats until the queue is empty, which is commonly said width-first traversal.
After the above three steps, a simple web crawler is implemented. But there are many problems, such as:
1) How to avoid access to duplicate URLs (access to duplicate URLs will result in infinite loops);
2) The robots protocol to be followed by the crawler;
3) How to avoid the frequent access caused by the network server "angry";
4) How to design the electronic product page acquisition strategy in this subject;
5) classifier;
6) Conversion of Web page format;
7) How to design the storage structure of the Web page and choose what kind of database to store the huge pages;
8) Because of the expiration of the Web page, how to design re-visit web page strategy;
9) How to efficiently crawl Web pages (will be explained in section 1.3);
URL duplication avoidance
Avoiding URL re-visit is a critical issue, and if the URL is repeatedly accessed, it will inevitably result in an infinite recursive access until the resource is exhausted. The general strategy is to maintain two tables: Visited_table and unvisited_table,visited_table represent the table of the visited URL, and unvisited_table is the equivalent of a "task pool", and the crawler constantly from Unvisited_ The table gets the URL to be accessed, which avoids repeated access to the page. The work steps are as follows:
1. Add a control process for the crawler thread, the main function of this process is to control the Crawler crawl Web page, maintain two URL table, etc., equivalent to a controller.
2. Every time the crawler crawls the webpage, obtains the URL from the unvisited_table, after downloading the webpage, carries on the series processing to the webpage to insert in the database, simultaneously analyzes the link in the webpage, submits the link to the control process.
3. The control process after getting the crawler URL link, compared to visited_table, see if there is, if not exist, the URL is inserted into the unvisited_table, and will return a URL to the crawler, the crawler continues to crawl the URL.
Visited_table can use a hash function, then visited_table is a bit array. Because the hash function is confronted with a conflict problem, if there is a higher precise requirement, the principle of using Bloom filter,bloom Filter is simple, it uses several different hash functions to determine, for example, initializing a bit array with all bits of 0, for a URL1, Using a number of different hash functions to calculate the bit array corresponding position 1, when determining whether the URL2 is accessed, the URL2 also use the hash function, if the corresponding bit array is 0, then the URL is not accessed, otherwise, if any of the bit array is 1, Indicates that the URL has been accessed, and the specific proof can refer to the reference.
Figure 1-2 Bloom Filter Working principle example diagram
As shown in 1-2, when using Bloom Filter on URL2, there is a bit bit set 1, so that the URL has been visited.
In the specific design implementation, the visited_table can be designed as a global map, using Java Currenthashmap, the data structure is HASHMAP thread-safe version, of course, can also achieve their own bloomfilter, and Unvisited_ The map is designed as a global thread-safe (using Java vilotile, which controls mutually exclusive access to each thread), listarray the control process as a separate thread, and the crawler thread communicates with the control thread using the Message Queuing mechanism. Each time a Web page is downloaded, the crawler inserts the contents of the message into a message queue and waits for the control thread to process it.
Avoid Web server "angry"
Why is the Web server "angry"? The network server can not withstand the frequent and fast crawler access, if the performance of the network server is not very strong, it will spend all the time to deal with the web crawler requests, and not to deal with real user requests, so it may be considered as a Dos attack, thereby prohibiting the IP of the crawler, so should avoid the network server " Angry ", how to deal with it? Usually after the crawler accesses the server, the crawler should wait a few seconds, so that the network server enough time to deal with other requests, while the crawler should also adhere to the robots protocol.
Robots protocol
The robots protocol is a way for Web sites and search engine crawlers to interact, and the network administrator places a robots.txt file on the root of the Web site, such as Https://www.google.com/robots.txt.
Where user-agent represents a crawler type (such as * for all reptiles), disallow represents a directory that prohibits crawling of the crawler, allow represents the page or directory that is allowed to crawl, so in the specific implementation of the search engine, you should also add a Robots protocol Analysis module, Strict adherence to the robots protocol only captures directories and Web pages that are allowed to be accessed by the web host.
Page Capture Strategy
There are two types of vertical search strategies:
1. Collect the downloads for the entire Internet page and then remove the irrelevant pages. The disadvantage of this approach is that it consumes a lot of disk space and bandwidth and is not available in a specific implementation.
2. The second approach is based on the fact that a topic page often has a related topic page. Where anchor text is important, it indicates the topic of related links, so in practical applications, multiple authoritative pages of a particular topic are used as seed pages.
Text categorization techniques. Crawlers use classifiers to determine whether the page is related to a given topic. Commonly used naive Bayesian classifier or support vector machine. These two types of classifiers are briefly explained below.
Classifier
1. Naive Bayesian classifier
2. Support Vector Machine
Web page format Conversion
The Chinese version of the computer is stored in hundreds of incompatible formats with each other. Standard text format includes original text, HTML, XML, Word and PDF, etc., if not in the correct way to deal with, often garbled situation, so need a tool, in processing to a new text format, can convert it to a common format, in this topic, You can convert it to HTML format. When a computer stores files, another problem is the encoding problem, the general solution to this problem is to view the encoding format of the page header, and then read the analysis in the appropriate encoding format.
So, when you download to a webpage, first look at the header information: Look at the text format and encoding format, and then the appropriate way to deal with it.
Web Storage Issues
There are two problems with storing a Web page: In what format is the page stored? What kind of database do you use?
1. First deal with the formatting problems of the storage Web page. If the downloaded Web page is stored directly in the database, there are two issues:
1) each page on average about 70K, the transfer of 70K data on disk will be very fast, it may take less than 1 milliseconds, but it may take 10 milliseconds to find, so when you open these scattered small files to read the document, it takes a lot of time overhead, A good solution is to store multiple documents in a separate file, stored in a self-contained format, such as:
<doc><docno>102933432<docno><docheader>http://www.sina.com.cn/index.htmltext/html 440http/1.1200 okip:221.236.31.210date:wed, 09:32:23 gmtcontent-encoding:gziplast-modified:tue, Jan 2016 01:52:09gmtserver:nginxcontent-length:119201</docheader><! DOCTYPE html>
In the example:
<DOC></DOC> marks the content of a webpage, which includes <DOCNO>, <DOCHEADER>, and the original content of the page;<docno></docno> The numbering of the tagged pages;,<docheader></docheader> the header information of the Web page, most of which is returned by the Web server, and the remainder as the Web content.
2) If the amount of data is too large, the stored files are often compressed to conserve disk space.
2. The second problem is the choice of storage systems. The storage system can choose relational database and NoSQL database, in the main search engine, seldom use the traditional relational database to store the document, because the massive document data will crush the relational database system, the non-relational database's powerful distributed ability, the massive data storage ability, Failure resilience and so on have prompted the search engine to favor the use of non-relational databases. In this topic can choose MongoDB or BigTable database system, they also have similar to MySQL strong community support, open source free and other features.
Page re-visit mechanism
Because many web pages will be updated, so you need to revisit the page to get the latest Web content, to keep the content of the Web library "with The Times", and different website update frequency is also different, this will be the Web page changes Analysis modeling.
The research shows that the change of Web page can be attributed to Poisson process model, and the reference literature is visible.
There are two common types of revisit strategies:
1) A unified re-visit strategy: The crawler at the same frequency revisit all the pages that have been crawled, in order to obtain a unified update opportunity, all the pages indiscriminately in accordance with the same frequency by the crawler re-visit.
2) Individual re-visit strategy: Different Web pages change frequency, crawler according to its update frequency to determine the frequency of re-visit individual pages. That is, each page is tailored to a crawler re-visit frequency, and the frequency of changes in the Web page and the rate of re-visit to any individual Web page is equal.
Of course, there are pros and cons of the two ways, for the subject and the electronic product search, because the Web page update frequency is slow, and the page update frequency is similar, so you can adopt strategy 1.
1.3 Distributed web crawlerMost of the above-mentioned content is based on a single node, but the single-machine system is difficult to meet the Internet Shanghai Volume Web search, which prompted us to choose a distributed Web-based collection system, because the collection of Web pages can be regarded as a separate task, the use of distributed systems, can greatly accelerate the collection of Web pages. This section will focus on the general architecture of the distributed web crawler and the main problems that arise in the design.
As can be seen in Figure 1-1, the collector corresponds to a controller, in the distributed collection system, there will be multiple collector controller pairs, and then a total controller, as shown in:
Figure 1-4 Distributed Web Collection system
1-4 is the overall distributed structure of the Web collection system, figure 1-4 shows the following:
1.3.1 Crawl ProcessThe crawl process works as follows:
Figure 1-5 Crawl process schematic flowchart
Note: There are multiple crawl threads within a crawl process.
Figure 1-5 illustrates the following:
A. When the crawl thread crawls the URL first, the Robots Protocol module determines whether the URL that the Web server is allowed to crawl and, if so, goes to B, otherwise goes to c;b. The crawl thread submits the crawled page to the Web-processing module, which analyzes the links within the Web page, organizes the contents of the Web page into a certain format (3.2.1 section), compresses the contents of the organization into the Web database, and returns all the linked content from the Web page to the crawl thread; c. The crawl thread submits all links to the Web page processing module (empty if not crawled) and the current URL to the reconcile process; The reconcile process returns the next crawled URL to the crawl thread, back to a.
In step B, the organization page is organized into a piece of pages, as described in section 1.2.1, and then inserted into the database after the organized content is compressed.
1.3.2 Coordination ProcessThe design of the reconcile process is key, which involves assigning URLs for the crawl process, handling non-local URLs, managing crawler threads, and so on.
The reconcile process starts at 0 and is numbered until n-1, where n is the number of crawlers, and each coordination process manages its own URL, which is the following policy:
URL = {url1,url2, ..., urln}, which is a collection of all URLs, defines the HOST (URL) as a part of the domain name of a Web page address, usually corresponding to a Web server, for example:urls = http ://www.scie.uestc.edu.cn/main.php?action=viewteacher, then HOST (URL) = http://www.scie.uestc.edu.cn, The strategy is to establish a mapping between the host (URL) and [n], and once a host (URL) maps to a collection node, the node is responsible for collecting all the pages under the host (URL). The mapping function can take a hash function. Each coordination process maintains two tables at the same time, as described in section 3.2.1, visited_table and unvisited_table, the specific pseudo-code flow for the coordination process is as follows [2]:
for (;;) Begina. Wait for a URL to come from another node, or the crawl process it governs returns a URL and related links. B. If you get a urlb.1 from other nodes see if the URL has already appeared in the visited_table, if not, then put the URL into the unvisited_table; c. Get hyperlink links if the URL returned from the crawl process is obtained; C.1 from the unvisited_table to the crawl process a new URL, and put the returned URL into visited_table; C.2 and each hyperlink symbol string host (link) is modulo n hash to get an integer i; C.3 for each hyperlink link and its corresponding integer i: C. 3.1 If this node is numbered I, perform the B.1 action C. 3.2 Otherwise, the link is sent to the node iend
The above pseudo-code is the coordination process working algorithm, explained in detail as follows:
In step A, in the reconcile process, only the URLs that belong to the region of this node are processed, so it waits for two events: the URLs belonging to this node from the other nodes and the crawled URLs from the crawl process on this node and RELATED links:
When this event is a URL from another node, you first need to verify that the URL has been accessed (using the hash function to compare in visited_table, as described in section 1.2.1), and if not, insert into unvisited_table to crawl the process;
When this event is a URL that is crawled by the crawl process for this node and related links (where the related links are links in the page content parsed by the Web page processing module), the orchestration assigns the next crawl URL to the crawl process. The coordinating process then parses the links, hashes the URLs in the link, and executes the B.1 operation if the result of the hash is the number of the reconcile process, which is to determine if the URL has been accessed, or if it has been accessed, then discarded, or inserted into unvisited_table When the result of the hash is not the coordination process, it is sent to the corresponding coordination process.
1.3.3 Dispatch ModuleThe job of the dispatch module is to maintain information about all the registration processes within the system, including their IP address and port number, which is responsible for transferring the updated information to other coordination processes when the information of any of the coordination processes changes.
The dispatch module is the key to the extensibility of the Web Collection system, which assigns URLs to the coordinated processes, as described in the previous section. The system extensibility is that when one of the coordination process modules collapses for some reason, the dispatch module allocates the information of the coordination process to other coordination process modules.
1.4 Summary The contents of this chapter mainly explain the composition, key issues and design details of the Web collection system of electronic products. At the beginning of the concept of a web crawler, and then describe a network crawler to consider a series of problems, such as avoiding URL duplication, format conversion, storage, etc., because the single-machine system is difficult to adapt to the crawl of a large number of Web pages, and then the article explains the distributed Web-based collection system architecture end of this chapter, of course, The actual design of the system, but also in these content on the basis of the actual trade-offs.
References [1] pan Xuefeng, spend your spring, Liang bin. into the search engine. 2011.5. Electronic industry publishing house.
[2] W.brucecroft, Donald Metzler, Trevor Strohman. 2010.2. Search Engine: Information retrieval practice. Beijing: Mechanical industry press.
[3] Li Xiaoming, Hongfei, Wang Jimin and so on. 2012. Search Engine: Principle, technology and system. Beijing: Science Press.
[4] Https://en.wikipedia.org/wiki/Web_search_engine.
[5] https://en.wikipedia.org/wiki/Hash_function.
[6] Https://en.wikipedia.org/wiki/Bloom_filter.
[7] Bloom,burton H. (1970), "Space/time trade-offs in Hash Coding with Allowableerrors", Communications of the ACM 13 (7) : 422–426, doi:10.1145/362686.362692
[8] Https://en.wikipedia.org/wiki/Naive_Bayes_classifier.
[9] Https://en.wikipedia.org/wiki/Support_vector_machine.
CHO, J. and Garcia-molina, h.2000a, estimating frequency of change. ACM transactions on Internet technology,vo1.3, No.3, 2003.8.
[One] Https://en.wikipedia.org/wiki/NoSQL
[Https://en.wikipedia.org/wiki/Rabin_fingerprint]
Web Collection System