Google search result sorting algorithm-detailed by Google engineers

Source: Internet
Author: User

Matt Cutts is a software engineer at Google's quality management department. His job is to Grade A website and develop technologies that prevent fake or junk websites from appearing on Google search results.

One of the most frequently asked questions raised by library administrators is: "What kind of results should be at the top of the search list? How should Google choose ?" Now, quality engineer Matt Kaz has introduced the Quick Start to explain how Google crawls and indexes on the Internet and how it grades search results. Matt also gave advice to the school librarians about how to coach students.

Crawling and Indexing

Before you browse a webpage that contains Google search results, many things will happen. The first is crawling and indexing on billions of web pages on the World Wide Web page. This is done by googlebot, which is responsible for connecting to global network servers to collect files. Crawlers do not actually roam on the Internet. Instead, they access the network server and return to a specific webpage. Then, they scan the webpage to create a hyperlink and compile numbers for each webpage. Crawling can collect a large number of files, but these files cannot be directly used for search.

If there is no index, Google servers will have to read the content of each file every time you search for content such as "Civil War" (Civil War. Therefore, the second step is to create an index, which requires "Conversion" to crawl the obtained data. To avoid scanning every word in each file, you need to write some articles on the data to display all the files that contain a specific word. For example, assume that the word "Civil" appears on files numbered 3, 8, 22, 56, 68, and 92, the word "war" appears on files numbered 2, 8, 15, 22, 68, and 77.

Once an index is created, files are graded and their relevance is determined. For example, if a person searches by Google and enters "Civil War", two things are required to present and evaluate the search results: one is to search for a webpage containing a user's question; second, the location of the matching webpage is scheduled Based on relevance. Google has developed an interesting technology to accelerate the first step: instead of storing all indexes on one computer, it uses hundreds of computers to do this. Because tasks are assigned to many computers, the query results are faster.

To describe this process more vividly, you can imagine the indexing of the next 30-page thick book. If a person searches for several pages in an index, it takes at least a few seconds for each search. But what if you allocate each page of the index to different people? Thirty people search for different parts of the index separately, which is much faster than one person alone. Similarly, Google distributes data to various computers so that files can be searched more quickly.

How can I find a webpage containing a user's question? Let's return to the above example of "civil war. The word "Civil" is in a file numbered 3, 8, 22, 56, 68, and 92, the word "war" is in a file numbered 2, 8, 15, 22, 68, and 77, we can display the file on the webpage and find the file containing two words (the files 8, 22, and 68 can be seen in the following table ).

Civil 3 8 22 56 68 92

War 2 8 15 22 68 77

Both words appear 8 22 68

A file list containing a word is called the "File ID List". Finding a file containing two words is called the "intersection of the file ID List ".

Rating search results

After a webpage contains a user's question, the webpage should be evaluated based on relevance. Google uses many technologies, among which the PageRank algorithm is the most famous. PageRank rates the number of links from a website to a webpage and the ranking of the websites that provide links. Using PageRank, links from CNN and the New York Times are worth twice the value of many less-famous websites.

In addition to PageRank, Google also uses many other technologies. For example, the words "Civil" and "war" in a file are very close, this is much more relevant than a file that only uses the word "war" and contains "Revolutionary War" (Independent War. In addition, the question "Civil War" Web page appears, and its relevance is more important than the question "19th century American clothing" (American clothing in the 19th century. Similarly, if "Civil War" appears several times on a webpage, it is more relevant than once.

Google aims to find websites with high popularity and relevance. If there are almost the same number of matching questions on the two webpages, we often choose a link to a more famous website. However, if other aspects indicate that a Web page is more relevant, a web page with fewer links or a lower ranking will be selected. For example, a web page is full about the "North-South war", which is more useful than a Web page that only mentions the "North-South war", even if this web page appears on a website that is not very famous. Once we have the file list and score, we will select the highest score and the most matching file.

Google extracts a few words from each file containing the question words as the abstract, and then displays the prepared URLs and summary on the search results. As you know, running a searcher requires a lot of computing resources. Each search requires more than 500 computers to work together, and the search takes less than half a second.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.