Link analysis algorithm: subject-sensitive PageRank

Source: Internet
Author: User

Link analysis algorithm: subject-sensitive PageRank

Mentioned in the previous discussion. PageRank ignores the relevance of the topic, resulting in a decrease in the relevance and theme of the results, and even a big difference for different users. For example, when searching for "Apple," a digital enthusiast may want to see the iphone, a grower may want to see Apple's price movement and planting skills, while a child may be looking for a simple apple stroke. Ideally, a dedicated set of vectors should be maintained for each user, but this approach is obviously not feasible in the face of massive users. So search engines generally choose a compromise scheme called subject-sensitive PageRank (topic-sensitive PageRank). Subject-Sensitive PageRank the practice is to predefined several topic categories, such as sports, entertainment, technology, etc., for each topic to maintain a single vector, and then find ways to correlate the user's topic tendency, according to the user's topic tendency to sort the results.

Subject-sensitive PageRank is an improved version of the PageRank algorithm that has been used by Google in the Personalization search service.

1. Basic Ideas

Basic idea:

A collection of PageRank vectors related to a topic is calculated offline, that is, a page's score on different topics is calculated. It is divided into two phases: topic-related PageRank vector set calculation and online query when the subject is determined (i.e., online similarity calculation).

2. Subject-Sensitive PageRank calculation process

1, determine the topic classification

The subject-Sensitive PageRank reference ODP website (www.dmoz.org) defines 16 major theme categories, including sports, business, technology and more. The ODP (Open Directory Project) is a manually organized, multi-layered, web-categorized navigation site (see Figure 1), with a more granular and smaller number of top 16 categories

Figure 1 ODP Home

Granular classification structure, under the lowest level directory, manually collected a selection of high-quality web addresses that meet the theme of the directory for Internet users to navigate. Subject-Sensitive PageRank uses 16 categories of the ODP's highest level as a pre-defined topic type.

2. Page Topic attribution

This step requires that each page be categorized into the most appropriate category, with a number of algorithms, such as the use of TF-IDF-based classifiers, or the clustering of human groups. The end result of this step is that each page is grouped into one of the topic.

3, sub-topic vector calculation

The vector iteration formula in PageRank:

i.e. r = Qxp * r + (11 q) * e/n (e-unit vector)

In the subject-sensitive PageRank, the vector iteration formula is:

The first is that the unit vector e becomes S.

and S is such a vector: for a topic s, if the page k in this topic, then the K element in S is 1, otherwise 0. Note that for each of the topic there is a different s. and |s | Represents the number of 1 in S.

Suppose there is a page a,b,c, D, assuming that page A is classified as Arts,b computers,c computers,d to Sports. So for Computers this topic,s is:

Suppose we set the damping coefficient q=0.8, and |s|=2, so the iteration formula is:

The last calculated vector is the rank of Computers this topic. If the actual calculation, you will find B, C page in this topic weight compared to the above non-topic-sensitive rank will rise, this shows that if the user is a person inclined to Computers topic (such as programmers), then in the results presented to him B, C is more important, so it may be ranked more forward.

4. Online similarity calculation

The final step is to determine the user's topic inclination to select the appropriate rank vector when the user submits the search. There are two main methods:

One is to list all topic to let users choose their own items of interest, which is often used when registering on some social quiz sites;

Another method uses the "User query classifier" to classify the query, that is, the search engine will use some means (such as cookie tracking) to track the user's behavior, data analysis to determine the user's tendency.

As shown in Figure 2, assume that the user entered the query Request "Jordan", the query term "Jordan" belongs to the sports category of the probability of 0.6, the probability of entertainment category is 0.1, the probability of business category 0.3.

Figure 2 on-line similarity calculation

At the same time, the search system reads the index, finds all the pages containing the user query "Jordan", and obtains the PageRank values of the computed classification topics, in the example of Figure 6-21, Suppose a page A's various topics PageRank values for sport 0.2, entertainment 0.3, and business 0.1 respectively.

The similarity of the Web page and query can be calculated by getting the category vector of the user query and the theme PageRank vector of a webpage. By calculating the product of two vectors, the correlation between them can be obtained. In the example in Figure 6-21, the similarity between page A and user query "Jordan" is:

Sim ("Jordan", A) = 0.6*0.2+0.1*0.3+0.3*0.1=0.18

The page containing the keyword "Jordan", based on the above method calculation, and the similarity of the user query, you can follow the similarity from high to low sort output, as the search results returned to the user.

3. Using subject-sensitive PageRank to construct a personalized search the above content introduces the basic idea and computational flow of subject-sensitive PageRank, and from its intrinsic mechanism, this algorithm is very suitable as a technical solution for personalized search.

In the example shown in Figure 2, the calculation similarity uses only the query word "Jordan" that the user is currently entering, and if it can be extended, it will not only use the current query term, but also take advantage of personalized information such as the user's past search history. For example, users have previously searched for "Nike", it can be inferred that the user input "Jordan" is to buy sports apparel, and if the previous search "Yao", it is likely that users want to get sports information. In this way, the user's personalized information and the current query can be fused to construct the search system, in order to achieve the purpose of personalized search, more accurate to provide search services.

4. The difference between subject-sensitive PageRank and PageRank

The PageRank algorithm basically follows the "Random walk model" mentioned in the previous section, that is, when a user browses a webpage, if he or she wants to jump to another page, randomly selects a link contained in this page to enter another page. Subject-Sensitive PageRank the conceptual model, and introduces a more realistic hypothesis. In general, users will be interested in some areas, and when browsing a page, this page is also related to a topic (such as sports reports or entertainment news), so when the user read the current page, want to jump, more inclined to click on the current page theme similar to the link, The theme-sensitive PageRank is a model that combines user interests, page themes, and the similarity of links to Web pages and current page themes. Obviously, this is more in line with the real user's browsing process.

PageRank is a global page importance metric, and each page is given a unique PageRank score based on the link situation. Theme Sensitive PageRank this point is different, the algorithm introduces 16 types of topics, for a Web page, corresponding to a topic type has a corresponding PageRank score, that is, each page will be given 16 topics related PageRank score.

After accepting the user query, the two algorithms also have a large difference in the processing mode. The PageRank algorithm is independent of the query and can only be used as a computational factor for the similarity calculation. Subject-Sensitive PageRank are query-related and can be used as a formula for similarity calculation alone. Also, subject-sensitive PageRank will need to use the classifier to calculate the membership of the query, which is subordinate to the pre-defined 16 topics, and use this information in the sort formula for the similarity calculation, after the user query is received.

Link analysis algorithm: subject-sensitive PageRank

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.