Mining of massive datasets-Link Analysis

Source: Internet
Author: User

5.1 PageRank

5.1.1 early search engines and term Spam

As people began to use search engines to find their way around the Web, unethical people saw the opportunity to fool search engines into leading people to their page.

Techniques for fooling search engines into believing your page is about something it is not, are calledTerm Spam.

The ability of term spammers to operate so easily rendered early search engines almost useless. To combat term spam, Google introduced two innovations:

1.PageRankWas used to simulate where web surfers. pages that wowould have a large number of surfers were considered more "important" than pages that wowould rarely be visited. google prefers important pages to unimportant pages when deciding which pages to show first in response to a search query.

2. The content of a page was judged not only by the terms appearing on that page, but byTerms used in or near the linksTo that page.

It is reasonable to ask why simulation of random surfers shocould allow us to approximate the intuitive notion of the "importance" of pages. There are two related motivations that have red this approach.

• Users of the web "vote with their feet." They tend to place links to pages they think are good or useful pages to look at, rather than bad or useless pages.
• The behavior of a random surfer indicates which pages users of the web are likely to visit. Users are more likely to visit useful pages than useless pages.

 

5.1.2 definition of PageRank

Think of the Web as a directed graph, where pages are the nodes, and there is an arc from page P1 to page P2 if there are one or more links from P1 to P2.

An example of a tiny version of the web, where there are only four pages.

Page A has links to each of the other three pages;

Page B has links to A and D only;

Page C has a link only to;

Page D has links to B and C only.

In general, we can defineTransition matrixOf the Web to describe what happens to random surfers after one step. this matrix m has n rows and columns, if there are n pages. the element mij in row I and column J has value 1/K if page J has K arcs out, and one of them is to page I.

The transfer matrix m in the above example is as follows. It is easy to understand that a may be linked to three pages, so the probability of linking to each page is 1/3. The same applies to others.

A B c d

A0 1/2 1 0
B1/3 0 0 1/2
C1/3 0 0 1/2
D1/3 1/2 0 0

So how can we generate PageRank based on this transfer matrix?

PageRank is used to simulate random web access. pages with higher access probability should be more important, that is, PageRank should be higher.

The above m is the initial probability that each page will be accessed. If you want to know the probability that each page will be accessed after N steps of random access on the web, how is it calculated?

If M is the transition matrix of the Web, then after one step, the distribution of the surfer will be mv0, is the following V1, and then after N steps, the VN is obtained, that is, the probability of Random Access to each page in n steps, that is, the PageRank of each page. this is a Markov process...

 V0 V1 V2 V3 VN

1/4 9/24 15/48 11/32 3/9
1/4M * V05/24M * V111/48M * v27/32 2/9
1/4 -----> 5/24 -----> 11/48 -----> 7/32... ... 2/9
1/4 5/24 11/48 7/32 2/9

To seeWhyMultiplying a distribution vector V by M gives the distribution X = mV at the next step, we reason as follows.

The probability XI that a random surfer will be at node I at the next step, is sumj (mijvj ). here, mij is the probability that a surfer at node J will move to node I at the next step (often 0 because there is no link from J to I ), andvj is the probability that the surfer was at node J at the previous step.

 

As we shall see, computing PageRank by Simulating Random surfers is a time-consuming process. one might think that simply counting the number of in-links for each page wocould be a good approximation to where random surfers wocould wind up.

Simplified PageRank doesn' t work

If that is all we did, then the hypothetical shirt-seller cocould simply create a "spam farm" of a million pages, each of which linked to his shirt page. then, the shirt page looks very important indeed, and a search engine wocould be fooled.

 

5.1.3 structure of the web

The previous section provides the basic PageRank algorithm, but the real web structure cannot be so simple. directly using this algorithm may cause some problems. Let's take a look at what the real web structure is like.

Figure 5.2: the "bowtie" picture of the web

The above bow tie diagram shows the actual web structure, which consists of the following parts,

1. A large strongly connected component (SCC).
2.In-component, Consisting of pages that cocould reach the SCC by following links, but were not reachable from the SCC.
3.Out-component, Consisting of pages reachable from the SCC but unable to reach the SCC.
4.Tendrils, Which are of two types. some tendrils consist of pages reachable from the in-component but not able to reach the in-component. the other tendrils can reach the out-component, but are not reachable from the out-component.
5.Tubes, Which are pages reachable from the in-component and able to reach the out-component, but unable to reach the SCC or be reached from the SCC.
6.IsolatedComponents that are unreachable from the large components (the SCC, in-and out-compunents) and unable to reach those components.

Several of these structuresViolateThe assumptions needed for the Markov process iteration to converge to a limit.

As a result, PageRank is usually modified to prevent following problems,

First isDead end, A page that has no links out.

The second problem is groups of pages that all have outlinks but they never link to any other pages. These structures are calledSpider traps.

 

5.1.4 avoiding dead ends

Recall that a page with no link out is called a dead end.

A B c d

A0 1/2 0 0
B1/3 0 0 1/2
C1/3 0 0 1/2
D1/3 1/2 0 0

As shown in the figure above, C is a dead end. The result of calculating the Page Rank using the Markov process iteration is 0, because once it reaches C, there is no next step.

A simple solution is to delete the Dead End Node and related edge, and then calculate the page rank of the remaining node.

Finally, the page rank of the Dead End Node is calculated based on the page rank of other nodes.

PRC = 1/3 * HPA + 1/2 * PRD

Of course, you can also use the taxation method described in the next section to solve this problem.

 

5.1.5 spider traps and taxation

As we mentioned, a spider trap is a set of nodes with no dead ends but no arcs out. these structures can appear intentionally or unintentionally on the web, and they cause the PageRank calculation to place all the PageRank within the spider traps.

This situation is different from dead end. After Entering spider trap, although there are still next steps, they can no longer leave.

A B c d

A0 1/2 0 0
B1/3 0 0 1/2
C1/3 0 1 1/2
D1/3 1/2 0 0

This gives a simplified version of spide trap, which has a self-loop for the c node, once it reaches the C node, it will not be able to leave... this is the abstraction and simplification of spide traps on the web.

Note that in general spider traps can have except nodes, there are spider traps with millions of nodes that spammers construct intentionally.

The result of using the Markov process iteration to calculate the Page Rank is that spide trap gets all the Page Rank, And the Page Rank of other nodes is 0.

In this way, spammer achieves the goal, and the spider traps he intentionally constructed obtain the highest PageRank.

The solution is to introduce randomness, which is the same as the solution to the local optimal problem. A random event jumps out of this trap...

So the preceding Markov process becomes as follows, and random surfer of 1 −β is added.

V' = β MV + (1−β) E/n

Where β is a chosen constant, usually in the range 0.8 to 0.9, E is a vector of all 1's with the appropriate number of components, and n is the number of nodes in the Web Graph.

With probability β, the random surfer decides to follow an out-link from their present page.

With probability 1 −β, a new random surfer at a random page

In this way, random surfer is added, and the final result is that although the spide trap node C will still get most PageRank, but there are some restrictions, other nodes will also get some PageRank.

 

5.2 efficient computation of PageRank

To compute the PageRank for a large graph representing the Web, we have to perform a matrix-vector multiplication on the order of 50 times, until the vector is close to unchanged at one iteration.

The presentation of the sparse matrix and the mapreduce method are involved...

 

5.3 topic-sensitive PageRank

5.3.1 motivation for Topic-sensitive Page Rank

Different people have different interests, and sometimes distinct interests are expressed using the same term in a query. the canonical example is the search query jaruar, which might refer to the animal, the automobile, a version of the Mac operating system, or even an existing ent game console.

However, the number of users is too large to generate a favorite vector, so,

The topic-sensitive PageRank approach creates one vector for each of some small number of topics, biasing the PageRank to favor pages of that topic.

Then, you only need to classify each user to a topic to determine the user's interests.

 

5.3.2 biased random walks

Suppose we have identified some pages that represent a topic such as "sports. "to create a topic-sensitive PageRank for sports, we can arrange that the random surfers are introduced only to a random sports page, rather than to a random page of kind.

In fact, the only difference is that surfer has a bias. The original model is that each surfer is random, and the probability of each page is consistent. Now with biased, it is assumed that the user will only (or a high probability) surfer to the page of a specific topic, such as sport.

The mathematical formulation for the iteration,

Suppose S is a set of the pages we have identified as belonging to a certain topic.

Let es be a vector that has 1 in the components in S and 0 in other components.

V' = β MV + (1−β) ES/| S |

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.