Big Data Learning Note 2 • Big Data research in Internet search

Source: Internet
Author: User

Design of large-scale web search logical structure of large-scale search engine


A paper from two founders of Google in 1998.

    1. Crawler: Crawlers, getting document information from the Internet
    2. Index: Read this information and remember which words appear in which documents, called indexes
    3. Search: Make keyword queries possible and sort the results of a query
    4. Google's uniqueness is that it describes the target document using anchor text and sorts the importance of the document with the links between the documents, which is PageRank.
Key data structures for Google search

    • To design a large file as a virtual file
    • Each page has three descriptive dimensions:

      1. Sync Sync Code: The beginning of a page's data length
      2. Length: page byte lengths
      3. Compressed packet: Compressed package, including DOCID (document ID), Ecode (encoded information), URL length, page length, and page
Index

    1. Sort by DocId
    2. Sort by URL
    3. Lexicon: Dictionary, is a lookup table. Save in memory, save words and hash tables made up of pointers.

    • Forward indexing, as you know, barrels records the document ID (DocID), the word ID (wordid), and how many times each word in the document appears. At the end of the index, there will be all occurrences and all hits.
    • An inverted index that contains information similar to a forward index. However, the information in the inverted index is arranged in a different way from the forward index, and the information in the inverted index is sorted according to the word. So, as you know, the word ID follows the number of documents that contain the word. Then some pointers to these documents. For each document, the inverted index records how many times it was hit in the document. There is also a list of hits at the end of the index.
Hit

Keywords appear in a page called hit. Google hit stores the type and location of hits.
Hits are divided into special hits and normal hits. A special hit is a keyword hit in the title, URL, metadata, and anchor text.

Crawler


Network crawl information is a complex task that requires the use of distributed crawling, DNS caching, and so on to improve crawl efficiency.

Search


Process steps for a search request:

    1. Query Request resolution
    2. Converts a word to wordids.
    3. Go to the beginning of the document list and get a list of all documents for each word in the bucket
    4. Search engine to sort each query request calculation document
    5. Returns the search results with the highest ranking score
How does the search engine process thousands of query requests within a second?

In reality, a commercial search engine includes many clusters, each of which is a full-scale search engine that stores all Web pages. And be able to handle various query requests. When a user enters a query, it is assigned to a cluster by a load balancer system based on the domain name service. The allocation takes into account both the user's distance from the physical cluster and the available capabilities.

These clusters are distributed around the world, and they may be located in different cities and different countries. For each query request, only one HTTP request is sent to a cluster. Now we can calculate if our target is 4,000 query requests per second. Wow! So much, but we have 10 clusters, and each cluster actually needs to process 400 query requests per second.

Let's look at what it looks like inside a cluster when a query request arrives.

    • First, the hardware-based load balancer assigns this query request to a Web server.

    • Then, there is a search engine cache on each Web server. If the query has been searched and cached before, the search engine cache will return results immediately.

Therefore, our goal is to process 400 query requests per second, and if 80% of the query requests use the results from the cache, then we only need to process 80 query requests per second.
In addition, the index server has many replicas, so assuming that each index server has 3 replicas, each index server only needs to process 20 query requests per second.

This slide shows the architecture of Google's query server.

    • Therefore, when a query request arrives, Google's Web server sends the query request to some index servers and obtains a list of search results from the Index server. If access to document information is required, the Index server assigns the request to a document server and obtains the required documentation.

The search engine also has a module called the spell checker, because there are often some primitive errors in the query term. If these raw errors can be corrected, then the search results will be better. Inside the index server, you can see the inverted index as you know it, and in the first part we introduce the inverted index.

For example, when a query request arrives, the query request contains T1 and T2 two words, both of which are sent to the index server, and the index server gets the reverse order table for the first word and the reverse order table for the second word from the inverted index respectively. A module then merges the two reverse permutation tables and calculates a correlation score for each document.

    • Finally, the search engine returns the document ID of the K document with the highest score.

Now, our goal is to have each Index Server return 27 query requests per second. Now, this number is not very big. Fortunately, we have some ways to optimize performance. For example, we can use dynamic pruning algorithms to calculate the highest scoring K documents. This means that we don't need to sort all the documents that contain T1 and T2, we only need the highest scoring K documents. Generally, the value of K is 10. So we can do it better. When we load each query term into a row table, we are able to calculate the intersecting portions of the descending table, evaluate each document and sort them according to the relevance score.

Reference documents:
[1] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine in Proceedings of the Seventh International World Wide Web Con Ference, 1998
[2] A. Barroso, J. Dean and U. Hlzle. Web Search for a planet:the Google Cluster Architecture IEEE Micro, 2003
[3] Sanjay Ghemawat, Howard Gobioff and and Shun-tak Leung. The Google File System. Sosp ' 03, 2003
[4] J Dean and S Ghemawat. Mapreduce:simplified Data processing on Large Clusters. 00104, 2004

Explore multiple dimensions of a search

Todo

Big Data Learning Note 2 • Big Data research in Internet search

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.