HITS algorithm (hypertext induced topicselection)
The HITS algorithm is also a very basic and important algorithm in link analysis. Currently, it has been used as a link analysis algorithm by Teoma search engine (www.teoma.com.
Hub page and authority page
The Hub page and the Authority page are two basic HITS algorithm definitions. The Authority page refers to a high-quality webpage related to a domain or a topic. For example, in the search engine field, Google and Baidu homepage are high-quality webpages in the field. In the video field, Youku and Tudou homepage are high-quality webpages in the field. The so-called hub page refers to a lot of webpages that point to high-quality authority pages. For example, the hao123 homepage can be considered as a typical high-quality hub page.
Figure 6-11 shows an example of a hub page, which is maintained by the computing linguistics Research Group of Stanford University. This page collects high-quality resources related to natural language processing, including some well-known open-source software packages and corpus, and direct them to these resource pages through links. This page can be considered as a hub page in the field of natural language processing. Correspondingly, most of the resource pages pointed to by this page are high-quality authority pages.
The purpose of the HITS algorithm is to find high-quality authority pages and hub pages related to user query topics on a massive number of webpages through technical means, especially the Authority pages, because these pages represent high-quality content that can be queried by users, the search engine returns the results to users as search results.
Mutual enhancement
Many algorithms are built on assumptions, and the HITS algorithm is no exception. The HITS algorithm implies and utilizes two basic assumptions:
· Basic Hypothesis 1: A good authority page will be pointed to by many good hub pages.
· Basic Hypothesis 2: A good hub page points to many good authority pages.
So far, either the definition of the hub page or the Authority page, or two basic assumptions, you can see a fuzzy description, that is, "High Quality" or "good ", so what is a "good" Hub page? What is a "good" authority page? The two basic assumptions provide the so-called "good" definition.
Basic Hypothesis 1 describes what a "good" authority page is, that is, the page directed to by many good hub pages is a good authority page. The two modifiers here are very important: "a lot" and "good ", the so-called "many" means that the more Hub pages point to the better. The so-called "good" means that the higher the quality of the hub page pointing to the page, the better the page. This combines the quantity and quality factors of all hub nodes pointing to this page.
Basic Hypothesis 2 describes what the "good" Hub page is, that is, the page pointing to many good authority pages is a good hub page. Similarly, the "many" and "good" modifiers are very important. The so-called "many" means that the larger the number of authority pages, the better. The so-called "good ", that is, the higher the Authority page quality, the better the hub page. This also takes into account the quantity and quality of all pages that the page has links.
From the above two basic assumptions, we can deduce the mutual enhancement between the hub page and the Authority page, that is, the higher the quality of the hub of a webpage, the better the authority of the page to which the link points; in turn, the higher the authority of a webpage, the higher the quality of the page hub with links pointing to the webpage. Through continuous iteration of the mutual enhancement relationship, we can find out which pages are high-quality Hub pages and which pages are high-quality authority pages.
HITS algorithm
A significant difference between the HITS algorithm and the PageRank algorithm is that the HITS algorithm is closely related to the user-input query requests, while the PageRank algorithm is a global algorithm unrelated to the query. Subsequent hits calculation steps are expanded after receiving user queries, that is, link analysis algorithms related to queries.
After receiving a user query, the HITS algorithm submits the query to an existing search engine (or a self-constructed search system) and extracts top-ranking webpages from the returned search results, obtain a set of initial webpages highly related to user queries. This set is called root set ).
Based on the root set, the HITS algorithm expands the webpage set. The principle of expansion is that all webpages with direct links to the webpage in the root set are expanded, whether there is a link pointing to a page in the root set or a page with a link pointing to the root set page, the page is expanded into the expanded web set. The HITS algorithm searches for a good hub page and a good authority page in this extended webpage set.
For an extended webpage set, we do not know which pages are good hub pages or good authority pages. Each webpage has potential possibilities. Therefore, two weights are set for each page, respectively to record the possibility that this page is a good hub page or authority page. In the initial situation, before more available information is available, the two weights on each page are the same and can be set to 1.
Then, you can use the two basic assumptions mentioned above and the principles of mutual enhancement to perform multiple rounds of Iterative Computing. Each iteration updates the two weights on each page, until the weight stability does not change significantly.
Figure 6-14 shows how to update the hub and authority weights of a page during Iterative Computing. Assume that a (I) represents the authority of web page I, and H (I) represents the hub weight of web page I. In the example shown in 6-14, the expanded web page set has three web pages with links pointing to page 1, and page 1 has three links pointing to other pages. In this iteration, the Authority value of Web Page 1 is the sum of all hub weights pointing to web page 1. Similarly, the hub value of Web Page 1 is the sum of the Authority values of the page to which it points.
The other pages in the expanded webpage set also update the two weights in a similar way. When the weights of each page are updated, a round of iterative calculation is completed, in this case, the HITS algorithm evaluates the difference between the weights in the previous iteration and those after the current iteration. If the weights do not change significantly in general, the system is in a stable State, the calculation can end. The page is sorted from high to low Based on authority weights. Several pages with the highest weights are output as search results in response to user queries. If the weights of two rounds of calculation are found to be significantly different, the next iteration continues until the weights of the entire system are stable.
HITS algorithm problems
The HITS algorithm is a good algorithm. It is not only used in the search engine field, but also used for reference by natural language processing, Social Analysis, and many other computer fields, and achieved good application results. Despite this, there are still some problems with the HITS algorithm in the original version, and many subsequent Link Analysis Methods Based on the HITS algorithm are also proposed based on these problems in improving the HITS algorithm.
In summary, the HITS algorithm is insufficient in the following aspects.
Low computing efficiency
Because the HITS algorithm is a query-related algorithm, it must be calculated in real time after receiving the user's query, and the HITS algorithm itself needs to perform many rounds of iteration computing to obtain the final result, this results in low computing efficiency, which must be carefully considered in actual applications.
Topic drift
If the expanded webpage set contains some pages unrelated to the query topic and there are many links between these pages, therefore, the HITS algorithm is likely to give these irrelevant webpages a high ranking, leading to topic drift in search results. This phenomenon is called tightly-knit community effect ).
Result manipulation by attackers
The HITS algorithm is easily manipulated by the operator. For example, the operator can create a webpage, and the page content is added with many URLs pointing to high-quality webpages or famous websites. This is a good hub page, then the attacker points the webpage link to the cheating webpage, which can improve the Authority score of the cheating webpage.
Unstable Structure
The so-called structure is unstable, that is, in the original expanded web page set, if you add or delete individual web pages or change the relationship between a few links, the ranking result of the HITS algorithm will be greatly changed.
Comparison between HITS algorithm and PageRank algorithm
The HITS algorithm and PageRank algorithm are two of the most basic and important algorithms for search engine link analysis. From the introduction of the two algorithms above, we can see that the two are very different in terms of basic concept models, computing ideas, and technical implementation details, the differences between the two are described one by one.
· The HITS algorithm is closely related to user-input query requests, but PageRank is irrelevant to query requests. Therefore, the HITS algorithm can be used as the similarity calculation Evaluation Standard separately, while PageRank can only be used to evaluate webpage Relevance Based on content similarity calculation.
· Because the HITS algorithm is closely related to user queries, it must be computed in real time after receiving user queries, resulting in low computing efficiency. PageRank can be calculated offline after crawling, directly use the computing results online, resulting in high computing efficiency.
· The HITS algorithm has a small number of computing objects. It only needs to calculate the links between webpages in the extended set. PageRank is a global algorithm that processes all Internet page nodes.
· Compared with the computing efficiency of the two and the size of the processing object set, PageRank is more suitable for deployment on the server side, and the HITS algorithm is more suitable for deployment on the client side.
· The HITS algorithm has topic generalization problems, so it is more suitable for processing specific user queries. The PageRank algorithm is more advantageous in processing broad user queries.
· The HITS algorithm calculates two scores for each page, while the PageRank algorithm only needs to calculate one score. In the search engine field, more emphasis is placed on the authority weights calculated by the HITS algorithm, however, in many other fields that use the HITS algorithm, the hub score also plays an important role.
· From the perspective of link anti-cheating, PageRank is better than the HITS algorithm in terms of mechanism, while the HITS algorithm is more vulnerable to link cheating.
· The HITS algorithm structure is unstable. When a small change is made to the link relationship in the expanded web page set, the final ranking is greatly affected. The PageRank algorithm is stable compared with hits, the root cause is the remote jump during PageRank computing.
-- This text is excerpted from "this is a search engine: detailed explanation of core technologies"
Book details: http://blog.csdn.net/broadview2006/article/details/7179396