Link analysis algorithm: Hits algorithm
HITS (HITS hyperlink-induced Topic Search) algorithm was first proposed by Dr. Jon Cornell of Cornell University (University Kleinberg) in 1997 for the IBM Company Almaden Research (IBM Almaden Research Center) is part of a study project called "CLEVER".
Hits algorithm is a very basic and important algorithm in link analysis, which has been used in practice by Teoma search engine (www.teoma.com) as the link analysis algorithm.
1. Hub Page and Authority page
The Hub page (Hub page) and authority page (authoritative page) are the most basic two definitions of the hits algorithm .
The so-called "authority" page, refers to a field or a topic related to high-quality web pages, such as the Search Engine field, Google and Baidu homepage is the field of high-quality web pages, such as the field of video, Youku and Potato home page is the field of high-quality web pages.
The so-called "hub" page, refers to contains a lot of high-quality "authority" page links to the page, such as hao123 home can be considered a typical high-quality "Hub" page.
Figure 1 shows an example of a "Hub" page, a page maintained by the Stanford University Computational Linguistics Research Group, which collects high-quality resources related to statistical natural language processing, including some well-known open source packages and corpora, and links to these resource pages. This page can be considered as the "Hub" page in the field of "natural language Processing", corresponding to the resource page that this page points to, most of which are high-quality "authority" pages.
Figure 1 Hub page in the Natural language processing area
The purpose of the hits algorithm is to find high-quality "authority" pages and "Hub" pages, especially "authority" pages, that are relevant to the user's query topics in a huge web page, in particular, because these pages represent high-quality content that satisfies the user's query. Search engines are returned to the user as search results.
2. Algorithm basic idea: mutually strengthens the relation
Basic hypothesis 1: A good "authority" page will be a lot of good "Hub" page point;
Basic hypothesis 2: A good "Hub" page will point to a lot of good "authority" pages;
3. Hits algorithm
Specific algorithm: the above mentioned two basic assumptions, as well as mutually reinforcing the relationship and other principles of multi-round iterative calculation, each iteration of the calculation to update each page two weights, until the weight stability no longer significant changes.
Steps:
3.1 Root Set
1) Submit the query Q to the search system based on the keyword query, from the collection of the returned results page to the total of the first n pages (such as n=200), as the root set (root set), recorded as root, the root satisfies:
1). The number of pages in root is small
2). The page in root is a page that is related to query Q
3). The Web page in root contains more authoritative (authority) pages
This set is a graph structure:
3.2 Extended Collection Base
Based on Root set root, the hits algorithm expands the collection of Web pages (see Figure 2), which is the extension principle: Any Web page that has a direct link to the root set is extended to the collection base, whether it's a link to a page in the root set, or the root set page has links to the page, are expanded into the Extension page collection base. The hits algorithm looks for a good "Hub" page and a Good "authority" page in this extended page collection.
Figure 2 Root set and expansion set
3.3 Compute hub VALUES (pivot degrees) and authority values (authority) for all pages in the expansion set base
1) , the Authority value (authoritative degree) and hub Value (center) of web node I are respectively represented.
2) for " extension set base", we do not know which pages are good "Hub" or Good "authority" page, each page is potentially possible, so for each page set two weights, The possibility that this page is a good hub or authority page is documented separately. In the initial case, the two weights for each page are the same and can be set to 1 before there is more information available, namely:
3) Calculate the hub weights and authority weights for each iteration:
Page A (i) the authority weight value in this iteration is the sum of all the hub weights that point to page A (i) Pages:
A (i) =σh (i);
The Hub score for page A (i) is the sum of the authority weights for the page to which you are pointing:
H (i) =σa (i).
Normalization of a (i) and H (i):
Divide the center of all pages by the highest center level to standardize them:
A (i) = A (i)/|a (i) | ;
Divide the authority of all pages by the highest level of authority to standardize them:
H (i) = h (i)/|h (i) | :
5) This constant repetition of the 4th): The weight of the last iteration of the calculation and the difference in weight after this round, if the overall weight is found to have no significant change, indicating that the system has entered a stable state, you can end the calculation, namely a (U), H (v) Convergence.
Algorithm Description:
3, the method of updating the hub Weights and authority weights of a certain page is given in the iterative calculation process. Assume that a (i) represents the authority weight of page I, and H (i) represents the hub weight of the Web page i. In the example in Figure 6-14, the "expanded page collection" has 3 pages with links to page 1, and Page 1 has 3 links to other pages. Then, on page 1, the authority weights in this iteration are the sum of all the hub weights that point to Page 1, and, similarly, the Hub score for Page 1 is the sum of the authority weights for the page being directed.
Figure 3 Hub and authority weights calculation
3.4 Output Sort Results
According to the authority right, the page should be sorted from high to low, with the highest weighted number of pages as the output of search results in response to user queries.
4. Problems with the hits algorithm
Hits algorithm, as a whole, is a very good algorithm, not only in the Search engine field, but also by the "Natural language Processing" and "social analysis" and many other computer fields for reference and use, and achieved a very good application effect. However, the initial version of the hits algorithm still has some problems, and many subsequent link analysis methods based on hits algorithm, but also based on the improvement of hits algorithm exists in these problems.
summed up, hits algorithm mainly in the following aspects of the deficiencies:
1. Low computational efficiency
Because the hits algorithm is a query-related algorithm, it must be calculated in real-time after receiving the user query, and the hits algorithm itself needs to do many rounds of iterative computation to obtain the final result, which leads to its low computational efficiency, which is a practical application must be carefully considered.
2. Topic Drift Issues
If the extension page collection contains some pages unrelated to the query topic, and there are more links between these pages, then using the hits algorithm is likely to give these unrelated pages a high ranking, leading to the subject of search results drift, this phenomenon is called "close link Community phenomenon" (Tightly-knit communityeffect).
3. Easy to be cheated by the results of manipulation
Hits from the mechanism is easily manipulated by cheaters, such as cheaters can build a Web page, page content to add a lot of points to high-quality web site or famous website URL, this is a good hub page, then the cheaters will link this page to the Cheat page, So you can improve the Cheat page Authority score.
4. Structural instability
The so-called structure is unstable, that is, in the original "expanded Web page collection", if you add to delete individual pages or change a few link relationships, the hits algorithm will have a very big change in the ranking results.
5. Comparison between hits algorithm and PageRank algorithm
hits algorithm and PageRank algorithm can be said to be the two most basic and important algorithms of search engine link analysis. From the above on the introduction of the two algorithms can be seen, both in the basic conceptual model or calculation ideas and technical implementation of the details are very different, the difference between the two are described in one-by-one.
The 1.HITS algorithm is closely related to query requests entered by the user, and PageRank is independent of the query request. Therefore, the hits algorithm can be used as the evaluation criterion of similarity, and PageRank must combine the content similarity calculation to evaluate the relevance of the Web page.
Because the 2.HITS algorithm is closely related to the user query, it must be calculated in real time after receiving the user query, the calculation efficiency is low, and the PageRank can be calculated offline after crawler crawl, and the calculation efficiency is higher by online direct use.
The number of computational objects in the 3.HITS algorithm is small, only the link relationship between the pages in the extended set is computed, and PageRank is the global algorithm, which processes all the Internet page nodes;
4. Comparing the computational efficiency of the two with the size of the object collection, PageRank is more suitable for deployment on the server side, while the hits algorithm is more suitable for the client.
The 5.HITS algorithm has a topic generalization problem, so it is more suitable to deal with materialized user queries, while PageRank has advantages in dealing with broad user queries;
6.HITS algorithm in the calculation, for each page needs to calculate two points, and PageRank only need to calculate a score, in the Search engine field, pay more attention to the hits algorithm calculated authority weights, but in many other areas of application hits algorithm, The hub score also has a very important role;
7. From the point of view of link anti-cheating, PageRank is superior to the hits algorithm in mechanism, and the hits algorithm is more susceptible to the effect of link cheating.
8.HITS algorithm structure is not stable, when the "extended Web page set" within the link relationship made a small change in the final ranking has a great impact, and PageRank relative to hits performance stability, the root cause is PageRank calculation of "remote jump."
Link analysis algorithm: Hits algorithm