The anatomy of a large-scale hypertextual Web Search Engine

Last Update:2018-12-03 Source: Internet

Author: User

Tags sorts

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The anatomy of a large-scale hypertextual Web search engine is written in http://infolab.stanford.edu /~ Backrub/google.html

This is an article on analyzing large-scale text search engines. It analyzes Google's search engines. In addition to the application of traditional search engines to large-scale data, this article also introduces new technical challenges, including using additional hypertext information to generate better search results. At the same time, this article also focuses on how to effectively process hypertext datasets that can be freely published by anyone.

1. Introduction The search engine technology has to be adjusted frequently to keep up with the growth of the network. Search engine creation has encountered a variety of challenges: Fast crawling technology is used to collect the latest web documents; effective storage space is used to store file indexes and documents. This requires that the index system can accommodate large data volumes and quickly process queries. As the network grows, this work becomes more and more difficult. Although the improvement of hardware performance and cost problems partially reduces the difficulty, problems such as disk track time and Operating System Robustness still exist. 2. system features Google search engine has two important features to help produce high-precision results. First, it uses the link structure of the network to calculate the ranking value of the web page, called PageRank. Second, Google uses the link to improve the search results. Calculate PageRank: PageRank does not count the number of links to the web page equally, but standardizes the number of links. Assume that the webpage t1... tn points to webpage A, and the index D is the inhibitor between 0 and 1 ( Damping Factor), Usually set to 0.85. At the same time, C (a) is defined as the number of links from page. The PageRank value of page a is as follows: Pr (A) = (1-D) + d (Pr (T1)/C (T1) +... + Pr (TN)/C (TN) Note that all PageRank forms a probability distribution on each page, so the PageRank of all webpages will be 1. link text is associated with the webpage to which the link points. There are two benefits to doing so. First, the link text can provide more accurate descriptions than the webpage itself. Secondly, link text is necessary for documents that cannot be indexed by text-based search engines, such as text, programs, and databases. 3. I will not go into details here. 4. System Analysis first, we will discuss the architecture as a whole. Then, it will describe important data structures at a deeper level. Finally, we will explain the main applications one by one: crawlers, indexers, and searchers. 4.1 Google Architecture Overview Figure 1 Google's overall architecture in Google, Web crawling is done by many distributed crawlers. A URL server sends a URL list to a crawler. The captured webpage is sent to the storage server (store
Server. The storage server compresses the web page and stores it in the repository. Each webpage has a unique ID associated with it, called docid. Every time a new URL is analyzed, it is assigned a docid. The index function is executed by the indexer and sorter. The indexer executes several functions. It reads the repository, decompress the document, and analyzes them. Each document is converted into a collection of vocabulary situations, known as hits. Hits records the vocabulary, position in the document, estimation of the font size, and Case sensitivity. The indexer distributes these hits to some barrels collections and creates a partially sorted forward index. The indexer also needs to execute another important function. It analyzes links in each web page and stores important information about them in an anchors text file. This file contains enough information to determine where and where each link is, and the text on the link. Urlresolver reads anchors, converts relative URLs to absolute URLs, and converts them to docids. It puts the link text into the forward index and associates the docid to which the link points. It also generates a link database (LINKS) for the docid key-value pair ). The Linked database is used to calculate PageRank for all documents. Sorter obtains the barrels sorted by docid and generates inverted indexes based on docid. In order that this operation requires almost no temporary space, this step must be executed in a specific place. The organizer generates a list of wordids and offsets in the inverted index. A program called dumpinglexicon compares the list with the dictionary (lexicon) generated by the indexer to create a new dictionary used by the searcher. The searcher is run by the network server and responds to queries using dictionaries constructed by dumpinglexicon, inverted indexes, and PageRank. 4.2 main data structure 4.2.1. bigfilesbigfiles is a virtual file that spans multiple file systems and can be addressed by a 64-bit integer. Space allocation between multiple file systems is automatically processed. The bigfiles package can also process the space allocation and recovery of file descriptors, because the functions provided by the operating system are not enough. Bigfiles also supports basic compression options. 4.2.2. the repositity repository contains the HTML full text of all webpages. Each page is compressed using zlib (see rfc1950. In the repository, documents are stored one by one, with docid, length, and URL as the prefix, 2. No other data structures are required to access the repository. This helps maintain data consistency and significantly simplify development; we can reconstruct all other data structure charts by using only the warehouse and one crawler error file. 2. The data structure 4.2.3 document index of the warehouse retains the information of each document. It is an isam (index sequential access mode) index, sorted by docid. The information stored in each item includes the current document status, pointer to the warehouse, document checksum, and various statistical data. If a web page has been crawled, it will also contain a pointer pointing to a file containing its URL and title, which will be docinfo. Otherwise, the pointer points to a list of URLs that only contain URLs. The reason for this design is that we want to have a simple data structure and get a record when only one disk is accessed in one query. Another file is used to convert the URL to docid. The file contains the URL checksum and the list of corresponding docids, which are sorted by checksum. To find the docid of a specific URL, calculate the URL checksum and perform binary search in the checksum file to find its docid. By merging this file, you can batch convert URLs into docids. This technique is used by the URL parser to convert a URL into a docid. The improvement of batch conversion is crucial, because otherwise we must perform a disk seek for each link. 4.2.4 the Lexicon dictionary has several different forms. Compared with the previous system, a major change is that dictionaries can be stored in the memory instead of spending much on the memory. In the current implementation, we can place the dictionary on a machine with MB memory. The current dictionary contains million words (although some rare words are not put in ). Its implementation is divided into two parts: a vocabulary list (connected together but separated by null) and a pointer hash list. For different functions, the vocabulary list has some auxiliary information. The complete discussion is beyond the scope of this article. 4.2.5 A hit lsts hit list corresponds to a specific word that appears in a specific document, including the location, font size, and Case sensitivity. The hit list occupies most of the space in the forward and inverted indexes. Therefore, it is important to make their forms as efficient as possible. We have considered several optional solutions to encode location, font size, and case-sensitive information-simple encoding (with three integers), compact encoding (with manually optimized bit allocation ), and Harman encoding. Finally, we chose a compact code that is manually optimized because it occupies much less space than simple encoding, and its bitwise operations are far less than the Harman encoding. The details of hit are shown in figure 3. Our Compact encoding uses two bytes for each hit. There are two types of hit: special hit and common hit. Special hit includes the hit that appears in the URL, title, link text, or meta tag. Common hit refers to all other hit. Common hit includes a case-sensitive, font size, and 12-bit vocabulary in the document. 4.2.6 The forward index is already partially sorted. It is stored in a certain number of buckets (64 buckets are used ). Maintain the wordid of this range in each bucket. If the word of a document falls into a bucket, the docid of this document is also stored in the bucket, followed by the wordid list and the hit list corresponding to these words. This scheme requires more storage space because of repeated storage of docid, but the difference is not big because there are many buckets, in addition, the last index operation of the organizer can save a lot of time and reduce the encoding complexity. Furthermore, we do not store the wordid itself, but store the relative location of the wordid and the smallest wordid in the bucket. In this way, we can only use 24 bits for the wordid In the unordered bucket, leaving 8 bits to the length field of the hit list. 4.2.7 The inverted index of the inverted index contains the same bucket as the forward index, but the inverted index has been processed by the organizer. For each valid wordid, the dictionary contains a pointer to the bucket where the word is located. It also points to a docid Doc list and the corresponding hit list. This Doc list indicates the occurrence of a word in all documents. An important issue is the order in which the docid should be stored in the doc list. A simple solution is to sort data by docid. In this way, you can easily merge different Doc lists to support multi-word queries. Another option is to store words in order of the appearance of each document. This makes it easy to respond to a single word query, and makes the results of multi-word queries close to the starting point. However, merging is much more complicated. At the same time, this makes development more difficult, because indexes must be re-built whenever the sorting function needs to be changed. Our choice is a compromise between the two, while maintaining two groups of inverted barrels-one group is used to store the hit list containing the title or link text, and the other stores all the hit lists. In this way, we first check the first group of buckets. If there is not enough matching in these buckets, we will find the buckets. To cover hundreds of millions of web pages, Google has a fast distributed crawling system. A URL server provides URLs for several (usually three) crawlers. URL servers and crawlers are implemented in Python. Each crawler usually maintains about 300 connections at the same time. Only in this way can the webpage capture speed be fast enough. At the peak speed, the four crawler systems can crawl hundreds of webpages every second. This means that the data rate is about KB per second. DNS resolution is a major performance bottleneck. Each crawler maintains its own DNS cache, so that it does not have to parse the DNS before crawling every document. Each of the hundreds of connections may be in several different States: DNS resolution, host connection, request sending, and response receiving. These factors make Crawlers a complex part of the system. It uses asynchronous Io to process events and has many queues to switch the page capture status. 4.4 index network analysis-any analyzer designed to run on the entire network must be able to handle a large number of possible errors. The range is from incorrect HTML Tag typing to zero-byte, non-ASCII characters between tags, and hundreds of layers of HTML Tag nesting, there are also a variety of errors that may appear as much as anyone could imagine as they would solve. To get the fastest speed, we didn't use YACC to generate a contextual grammar analyzer. Instead, we used a flex-generated vocabulary analyzer, which only needed to have its own stack. Developing such a fast and robust analyzer requires a lot of work. Index the document to the bucket-after analyzing each document, it is encoded into some buckets. Each word is converted into a wordid by a dictionary in the memory. When a new word is added to the dictionary hash, it is recorded in the log file. Once all words are converted to wordid, their appearance in the current document will be converted to the hit list and written to the bucket. The main difficulty in the Index Phase concurrency is that the dictionary needs to be shared. Instead of sharing the dictionary, we are committed to writing all the additional words not in the basic dictionary into the log. Our basic dictionary contains million words. In this way, multiple indexers can run concurrently, and the small log file can be processed by the last indexer.
Sort-in order to generate inverted indexes, the organizer sorts each positive bucket by the wordid and generates Inverted bucket for the title, link, and full question respectively. Only one bucket is processed at a time, so it requires a small amount of temporary storage space. In addition, we will sort the sorting steps in parallel. As long as we simply run multiple schedulers, we can use all the machines as much as possible, so that we can process multiple buckets at the same time. Because the bucket is too large to be loaded into the memory, the organizer further divides them into baskets that can be placed into memory based on the wordid and docid. Then the organizer loads each basket into the memory, sorts it, and writes the content to the short Inverted bucket and full text Inverted bucket. 4.5 search aims to efficiently provide high-quality results. Many commercial search engines seem to have made significant improvements in efficiency. The review process of Google queries is shown in table 4. 4.5.1 the sorting system Google maintains more information about network documents than the general search engine. Each hit list contains location, Font, and case information. In addition, we also considered the impact of hit in the link text and PageRank in the document. It is very difficult to consider this information comprehensively in sorting. There is no such factor that affects the design of the sorting function. First, consider the simplest case: query a single word. To rate a document in a single word query, Google first finds the word in the hit list of this document. Google assumes that each hit belongs to a certain type (title, Link, URL, font size, and font size ......), Each type has its own weight. These weights constitute a vector composed of index types. Google calculates the hit count for each type in the hit list. Then, convert the Count value to the Count-weight value. Count-the weight increases linearly with the Count value but stops increasing quickly. When the count exceeds a certain value, it is irrelevant to the count. The Inner Product of Count-weight and type-weight vector is used as the IR score of this document. Finally, the final rating of this document is given by combining IR and PageRank.
Multi-word queries are much more complicated. Multiple Hit lists must be scanned at the same time to make the adjacent hit values in the same document higher than those in the same document. Hit is distributed in different hit lists for separate matching, so that adjacent hit matches together. For each matching hit set, a degree of proximity is calculated. This degree of closeness is calculated based on the distance of hit in the document (or link), but is divided into ten levels, ranging from phrase matching to "irrelevant ". We count not only for each type of hit, but also for each type and degree of closeness. The count is converted to the Count-weight. Then we calculate the IR Score Based on the Inner Product of the count-weight and type-closeness-weight. All these numbers and matrices can be displayed in a special debugging mode. These display outputs are of great help to the development of the sorting system. 4.5.2 The feedback sorting function has many parameters such as type-weight and type-approximate value-weight. The correct result is returned from these parameters, just like in a dark box. To achieve this goal, our search engine carries a user feedback mechanism. Trusted users can make selective comments on all returned results. The feedback information is saved. Then, when we modify the sorting function, we can see the effect of this change on the previously sorted search. Although far from perfect, this allows us to understand how changes in the sorting function affect the results. 5. results and performance the quality of all these results is quite high. The final check, none of which is a dead chain. This is largely because they all have high PageRank values. The PageRank value is expressed by the percentage of the red part in the bar, indicating that 5.1 of the storage requirements, in addition to the search quality, Google's design enables it to effectively adjust costs as the network grows. This includes storage efficiency issues. Table 1 lists some statistical results and Google's storage requirements. Due to the application of the compression count, the total size of the repository is about 53 GB, which is more than 1/3 of the actual size of the web page to be stored. According to the current disk price, warehouses have become low-cost and useful data sources. More importantly, the total size of all data required by the search engine is about 55 GB. In addition, most queries only require a short inverted index. With the file index-optimized encoding and compression method, a high-quality search engine can run on a new PC with a 7 GB disk.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More