Link Analysis Algorithm: HITS algorithm

Last Update:2018-12-05 Source: Internet

Author: User

Tags rounds

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The HITS (Hyperlink-Induced Topic Search) algorithm was first proposed by Dr. Jon kleberger of Cornell University in 1997, part of the "CLEVER" Research project for IBM's Almaden Research Center.

The HITS algorithm is a very basic and important algorithm in link analysis. Currently, it has been used as a link analysis algorithm by Teoma search engine (www.teoma.com.

1. Hub page and Authority page

Hub pages and Authority pages are two basic HITS algorithm definitions..

The so-called "Authority" Page refers to high-quality webpages related to a certain field or topic, such as the search engine field. Google and Baidu homepage are high-quality webpages in this field, such as video fields, youku and Tudou are high-quality webpages in this field.

The so-called "Hub" Page refers to webpages that contain links to high-quality "Authority" pages. For example, the hao123 homepage can be considered as a typical high-quality "Hub" webpage.

Figure 1 shows an example of a "Hub" page, which is maintained by the computing linguistics Research Group of Stanford University. This page collects high-quality resources related to natural language processing, including some well-known open-source software packages and corpus, and direct them to these resource pages through links. This page can be considered as a "Hub" page in the field of "natural language processing". Most of the resource pages pointed to by this page are high-quality "authority" pages.

Figure 1 hub page in natural language processing

The purpose of the HITS algorithm is to find the high-quality "authority" Page and "Hub" Page related to the topic of a user query on a massive number of webpages through technical means, especially the "authority" page, because these pages represent high-quality content that can be queried by users, the search engine returns the results to users as search results.

2. Basic Idea of algorithms: mutual reinforcement

Basic Assumption 1: A good "authority" PageWill be a lot betterThe "Hub" Page points;

Basic Assumption 2: A good "Hub" PageWill point to a lot of good"Authority" page;

3. HITS algorithm

Specific algorithms:You can use the two basic assumptions mentioned above and the principles of mutual enhancement to perform multiple rounds of Iterative Computing. Each iteration updates two weights for each page, until the weight stability does not change significantly.

Steps:

3.1 Collection

1) Submit the query Q to the keyword-based search system, and retrieve the first n webpages (such as N = 200) from the set on the returned results pageRoot set), As root, the root meets the following requirements:

1). the number of web pages in root is small.

2) The webpage in root is related to q query.

3). The webpage in the root contains more authoritative webpages.

This set is a directed graph structure:

3.2 extended set base

Based on the root of the root set, the HITS algorithm expands the Web set (see figure 2 ).BaseThe principle of expansion is that all webpages that have direct links to the root set page are extended to the set base, whether a link points to the page in the root set, or a page with links to the root set page is expanded into the extended page set.Base. The HITS algorithm searches for the "Hub" page and the "Authority" page in the expanded webpage set.

Figure 2 root set and Extension Set

3.3 calculate the Hub and Authority values (Authority) of all pages in the extended set base)

1), Indicating the webpage node respectively
The Authority value (Authority) and Hub value (center degree) of I ).

2) ForExtension SetBase"For example, we do not know which pages are good" Hub "or" Authority "pages. Each page has potential possibilities. Therefore, we set two weights for each page, respectively to record the possibility that this page is a good Hub or Authority page. In the initial situation, before more available information is available, the two weights on each page are the same and can be set to 1, that is:

3) Calculate the Hub and Authority weights in each iteration:

In this iteration, the Authority value of web page a (I) is the sum of all the Hub weights pointing to the web page a (I:

A (I) = Σ h (I );

The Hub value of web page a (I) is the sum of the Authority values of the page to which it points:

H (I) = Σ a (I ).

Normalize a (I) and h (I:

Divide the centers of all webpages by the highest centers to standardize them:

A (I) = a (I)/| a (I) |;

Divide the authority of all webpages by the highest authority to standardize them:

H (I) = h (I)/| h (I) |:

5) repeated 4th ):The difference between the weights in the previous iteration calculation and those after the current iteration. If the weights do not change significantly in general, the calculation can end if the system is in a stable State, that is, a (u), h (v) convergence.

Algorithm Description:

3 shows how to update the Hub and Authority weights of a page during Iterative Computing. Assume that A (I) represents the Authority of web page I, and H (I) represents the Hub weight of web page I. In the example in Figure 6-14, "expand a Web page set" has three web pages with links pointing to page 1, and page 1 has three links pointing to other pages. In this iteration, the Authority value of Web Page 1 is the sum of all Hub weights pointing to web page 1. Similarly, the Hub value of Web Page 1 is the sum of the Authority values of the page to which it points.

Figure 3 Calculation of Hub and Authority weights

3.4 output the sorting result

The page is sorted from high to low Based on Authority weights. Several pages with the highest weights are output as search results in response to user queries.

4. Problems with the HITS algorithm

The HITS algorithm is a good algorithm. It is not only used in the search engine field, but also used for reference in many other computer fields such as natural language processing and social analysis, and achieved good application results. Despite this, there are still some problems with the HITS algorithm in the original version, and many subsequent Link Analysis Methods Based on the HITS algorithm are also proposed based on these problems in improving the HITS algorithm.

In summary, the HITS algorithm is insufficient in the following aspects:

1. Low computing efficiency

Because the HITS algorithm is a query-related algorithm, it must be calculated in real time after receiving the user's query, and the HITS algorithm itself needs to perform many rounds of iteration computing to obtain the final result, this results in low computing efficiency, which must be carefully considered in actual applications.

2. Topic drift

If the expanded webpage set contains some pages unrelated to the query topic and there are many links between these pages, therefore, the HITS algorithm is likely to give these irrelevant webpages a high ranking, leading to topic drift in the search results. This phenomenon is called "Tightly-Knit CommunityEffect ).

3. Results easily manipulated by the attacker

HITS is easily manipulated by writers. For example, writers can create a webpage and add a lot of content to websites that direct to high-quality webpages or famous websites. This is a good Hub page, then the attacker points the webpage link to the cheating webpage, which can improve the Authority score of the cheating webpage.

4. unstable structure

The so-called structure is unstable, that is to say, in the original "expanded web page set", if you add or delete individual web pages or change the relationship between a few links, the ranking result of the HITS algorithm will be greatly changed.

5. Comparison between the HITS algorithm and PageRank algorithm

The HITS algorithm and PageRank algorithm are two of the most basic and important algorithms for search engine link analysis. From the introduction of the two algorithms above, we can see that the two are very different in terms of basic concept models, computing ideas, and technical implementation details, the differences between the two are described one by one.

1. The HITS algorithm is closely related to the query requests entered by the user, and PageRank is irrelevant to the query requests. Therefore, the HITS algorithm can be used as the similarity calculation Evaluation Standard separately, while PageRank can only be used to evaluate webpage Relevance Based on content similarity calculation;

2. because the HITS algorithm is closely related to user queries, it must be computed in real time after receiving user queries, resulting in low computing efficiency. PageRank can be calculated offline after crawling, directly use the computing results online, resulting in high computing efficiency;

3. The HITS algorithm has a small number of computing objects. Only the links between webpages in the expanded set need to be calculated. PageRank is a global algorithm that processes all Internet page nodes;

4. Compared with the computing efficiency of the two and the size of the processing object set, PageRank is more suitable for deployment on the server side, and the HITS algorithm is more suitable for deployment on the client side;

5. The HITS algorithm has topic generalization problems, so it is more suitable for handling specific user queries. PageRank is more advantageous in processing broad user queries;

6. the HITS algorithm calculates two scores for each page, while PageRank only needs to calculate one score. In the search engine field, more attention is paid to the Authority value calculated by the HITS algorithm, however, in many other fields that use the HITS algorithm, the Hub score also plays an important role;

7. From the perspective of link anti-cheating, PageRank is better than the HITS algorithm in terms of mechanism, while the HITS algorithm is more vulnerable to link cheating.

8. the HITS algorithm structure is unstable. When a small change is made to the link relationship in the "expanded webpage set", the final ranking will be greatly affected. PageRank is stable compared with HITS, the root cause is the "remote jump" during PageRank computing ".

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More