How Google Finds Your Needle in the Web ' s Haystack

Source: Internet
Author: User
Tags relative
How Google Finds Your Needle in the Web ' s Haystack

As we'll see, the trick are to ask the Web itself to rank the importance of pages ...

David Austin
Grand Valley State University
David at merganser.math.gvsu.edu

Mail to a friend Print this article

Imagine A library containing billion documents but with no centralized organization and no librarians. In addition, anyone could add a document at any time without telling anyone. Feel sure that one of the documents contained in the collection have a piece of information that's vitally Importa NT-to-you, and, being-impatient like the most of us, you ' d like to find it in a matter of seconds. How would-go about doing it?

Posed in this, the problem seems impossible. Yet this description isn't too different from the world Wide Web, a huge, highly-disorganized collection of documents in Many different formats. Of course, we ' re all familiar with search engines (perhaps you found the article using one) so we know that there are a so Lution. This article would describe Google ' s PageRank algorithm and how it returns pages from the web's collection of billion do Cuments. Match search criteria so well, "Google" has become a widely used verb.

Most search engines, including Google, continually run a army of computer programs that retrieve pages from the web, Index the words in each document, and store this information in an efficient format. Each time a user asks for a web search using a search phrase, such as "search engine," The search engine determines all th e pages on the Web this contains the words in the search phrase. (Perhaps additional information such as the distance between the words "search" and "engine" would be noted as well.) Here's the Problem:google now claims to index billion pages. Roughly 95% of the text in Web pages are composed from a mere to words. This means, searches, there'll be a huge number of pages containing the words in the search phrase. What's needed is a means of ranking the importance of the pages of this fit the search criteria so, the pages can be Sor Ted with the most important pages at the top of the list.

One-to determine the importance of pages are to use a human-generated ranking. For instance, seen pages, and consist mainly of a large number of links to other resources in a particular AR EA of interest. Assuming the person maintaining this page was reliable, the pages referenced is likely to be useful. Of course, the list may quickly fall out of date, and the person maintaining the list may miss some important pages, eithe R unintentionally or as a result of an unstated bias.

Google ' s PageRank algorithm assesses the importance of Web pages without human evaluation of the content. In fact, Google feels the value of the IT service is largely in it ability to provide unbiased results to search Querie S Google claims, "the heart of our software is PageRank." As we'll see, the trick are to ask the Web itself to rank the importance of pages. How to tell the Who ' s important

If you've ever created a Web page, you've probably included links to other pages, contain valuable, reliable infor Mation. By doing so, you is affirming the importance of the pages you link to. Google ' s PageRank algorithm stages a monthly popularity contest among all pages in the Web to decide which pages is most Important. The fundamental idea put forth by PageRank's creators, Sergey Brin and Lawrence page, is this:the importance of a page is Judged by the number of pages linking to it as well as their importance.

We'll assign to each Web page P a measure of its importance I (P), called the page ' s PageRank. At various sites, your may find a approximation of a page ' s PageRank. (for instance, the home page of the American Mathematical Society currently have a PageRank of 8 on a scale of 10. Can you find any pages with a PageRank of 10?) This reported value is a approximation since Google declines to publish actual pageranks a effort to frustrate T Hose would manipulate the rankings.

Here's how the PageRank is determined. Suppose that page Pj has LJ links. If One of those links is to page pi, then Pj'll pass on 1/LJ of its importance to Pi. The importance ranking of Pi is and the sum of all the contributions made by pages linking to it. That's, if we denote the set of pages linking to Pi by Bi, then

This could remind you of the chicken and the egg:to determine the importance of a page, we first need to know the Importanc E of the pages linking to it. However, we may recast the problem to one that's more mathematically familiar.

Let's first create a matrix, called the hyperlink matrix, in which the entry of the ith row and jth column is

Notice that H have some special properties. First, its entries is all nonnegative. Also, the sum of the entries in a-column is one unless the page corresponding to that column has no links. Matrices in which all the entries was nonnegative and the sum of the entries in every column was one is called stochastic; They would play an important role in our stories.

We'll also form a vector whose of pageranks--that is, the importance rankings--of all the pages. The condition above defining the PageRank may be expressed as

In other words, the vector I was an eigenvector of the matrix H with eigenvalue 1. We also call this a stationary vector of H.

Let's look at an example. Shown below is a representation of a small collection (eight) of the Web pages with links represented by arrows.

The corresponding matrix is

With stationary vector

This shows, page 8 wins the popularity contest. Here's the same figure with the Web pages shaded in such a the-the-the-pages with higher pageranks is lighter.

Computing I

There is many ways to find the eigenvectors of a square matrix. However, we are in for a special challenge since the Matrix H was a square matrix with one column for each web page indexed by Google. This means. H have about n = billion columns and rows. However, the most of the entries in H is zero; In fact, the studies show that Web pages has an average of about ten links, meaning that, on average, all, and entries in EV ery column is zero. We'll choose a method known as the Power method for finding the stationary vector I of the matrix H.

How do does the Power method work? We begin by choosing a vector I 0 as a candidate for I and then producing a sequence of vectors I K by

The method is founded on the following general principle that we'll soon investigate.

General principle:the sequence I k would converge to the stationary vector I.

We'll illustrate with the example above.

I 0 I 1 I 2 I 3 I 4 ... I 60 I 61
1 0 0 0 0.0278 ... 0.06 0.06
0 0.5 0.25 0.1667 0.0833 ... 0.0675 0.0675
0 0.5 0 0 0 ... 0.03 0.03
0 0 0.5 0.25 0.1667 ... 0.0675 0.0675
0 0 0.25 0.1667 0.1111 ... 0.0975 0.0975
0 0 0 0.25 0.1806 ... 0.2025 0.2025
0 0 0 0.0833 0.0972 ... 0.18 0.18
0 0 0 0.0833 0.3333 ... 0.295 0.295

It is natural-ask what these numbers mean. Of course, there can be no absolute measure of a page ' s importance, only relative measures for comparing the importance of Both pages through statements such as "page A is twice as important as page B." For this reason, we could multiply all the importance rankings by some fixed quantity without affecting the information they Tell us. In this, we'll always assume, for reasons to being explained shortly, that's the sum of all the popularities are one. three important questions

Three questions naturally come to mind:does the sequence I K always converge? is the vector to which it converges independent of the initial vector I 0? Does the importance rankings contain the information that we want?

Given the current method, the answer to all three questions is "no!" However, we'll see if modify our method so we can answer "yes" to all three.

Let's first look at a very simple example. Consider the following small web consisting of the Web pages, one of the which links to the other:

With matrix

Here is one of the which our algorithm could proceed:

I 0 I 1 I 2 I 3=i
1 0 0 0
0 1 0 0

In this case, the importance rating of both pages are zero, which tells us nothing about the relative importance of the SE pages. The problem is That p2 has no links. Consequently, it takes some of the importance from page p1 in each iterative step but does don't pass it on to any Other page. This have the effect of draining all the importance from the Web. Pages with no links is called dangling nodes, and there is, of course, many of them in the real web we want to stud Y. We'll see if deal with them in a minute, but first let's consider a new-to-the-from thinking about the Matrix h&nbs P;and stationary vector i. A probabilitistic Interpretation of h

Imagine that we surf the web at random; That's, when we find ourselves on a Web page, we randomly follow one of it links to another page after one second.  For instance, if we is on page Pj with LJ Links, one of which takes us-page Pi, the probability that we next end Pi page is then.

As we surf randomly, we'll denote by

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.