This article is excerpted from Chapter 6 "this is a search engine: Core Technology details ".
The HITS algorithm is also a very basic and important algorithm in link analysis. Currently, it has been used as a link analysis algorithm by Teoma search engine (www.teoma.com.
6.4.1 hub page and authority page
The Hub page and the Authority page are two basic HITS algorithm definitions. The so-called "authority" Page refers to high-quality webpages related to a certain field or topic, such as the search engine field. Google and Baidu homepage are high-quality webpages in this field, such as video fields, youku and Tudou are high-quality webpages in this field. The so-called "Hub" Page refers to webpages that contain links to high-quality "authority" pages. For example, the hao123 homepage can be considered as a typical high-quality "Hub" webpage.
Figure 6-11 shows an example of a "Hub" page, which is maintained by the computing linguistics Research Group of Stanford University. This page collects high-quality resources related to natural language processing, including some well-known open-source software packages and corpus, and direct them to these resource pages through links. This page can be considered as a "Hub" page in the field of "natural language processing". Most of the resource pages pointed to by this page are high-quality "authority" pages.
Figure 6-11 hub page in the natural language processing field
The purpose of the HITS algorithm is to find the high-quality "authority" Page and "Hub" Page related to the topic of a user query on a massive number of webpages through technical means, especially the "authority" page, because these pages represent high-quality content that can be queried by users, the search engine returns the results to users as search results.
6.4.2 mutual enhancement
Many algorithms are built on assumptions, and the HITS algorithm is no exception. The HITS algorithm implies and utilizes two basic assumptions:
Basic Assumption 1: A good "authority" PageWill be a lot betterThe "Hub" Page points;
Basic Assumption 2: A good "Hub" PageWill point to a lot of good"Authority" page;
So far, you can see a fuzzy description from the definition of the "Hub" or "authority" page, or from two basic assumptions, that is, "high quality" or "good", so what is "good" hub page? What is a "good" authority page? The two basic assumptions provide the so-called "good" definition.
Basic Hypothesis 1 describes what a "good" authority page is, that is, the page directed to by many good hub pages is a good "authority" page. The two modifiers here are very important: "many" and "good", the so-called "many", that is, the more Hub pages point to the better, the so-called "good" means that the quality of the "Hub" page pointing to this page is higher, the better the page. This combines the quantity and quality factors of all hub nodes pointing to this page.
Basic Hypothesis 2 describes what the "good" Hub page is, that is, the page pointing to many good authority pages is a good hub page. Similarly, the "many" and "good" modifiers are very important. The so-called "many" means that the larger the number of authority pages, the better. The so-called "good ", that is, the higher the Authority page quality, the better the hub page. That is, the quantity and quality of all pages that the page has links to are considered comprehensively.
From the above two basic assumptions, we can deduce the mutual enhancement between the hub page and the Authority page (see Figure 6-12), that is, the higher the hub quality of a webpage, the higher the authority of the page to which the link points, the higher the authority of a Web page. Through continuous iteration of the mutual enhancement relationship, we can find out which pages are high-quality Hub pages and which pages are high-quality authority pages.
Figure 6-12 mutual enhancement
6.4.3 HITS algorithm
A major difference between the HITS algorithm and the PageRank algorithm is that the HITS algorithm is closely related to the user-input query requests, while PageRank is a global algorithm unrelated to the query. Subsequent hits calculation steps are expanded after receiving user queries, that is, link analysis algorithms related to queries.
After receiving a user query, the HITS algorithm submits the query to an existing search engine (or a self-constructed search system) and extracts top-ranking webpages from the returned search results, obtain a set of initial webpages highly related to user queries. This set is called the root set ).
Based on the root set, the HITS algorithm expands the Web set (see Figure 6-13). The principle of expansion is: all webpages that have direct links to the webpage in the root set are extended, whether a link points to the page in the root set or a link points to the page in the root set, are expanded to enter the extended web page set. The HITS algorithm searches for the "Hub" page and the "authority" page in the expanded webpage set.
Figure 6-13 root set and Extension Set
For "expanding web page sets", we do not know which pages are good "Hub" or "authority" pages. Each web page has potential possibilities, therefore, two weights are set for each page to indicate whether the page is a good hub or authority page. In the initial situation, before more available information is available, the two weights on each page are the same and can be set to 1.
Then, you can use the two basic assumptions mentioned above and the principles of mutual enhancement to perform multiple rounds of Iterative Computing. Each iteration updates the two weights on each page, until the weight stability does not change significantly.
Figure 6-14 shows how to update the hub and authority weights of a page during Iterative Computing. Assume that a (I) represents the authority of web page I, and H (I) represents the hub weight of web page I. In the example in Figure 6-14, "expand a Web page set" has three web pages with links pointing to page 1, and page 1 has three links pointing to other pages. In this iteration, the Authority value of Web Page 1 is the sum of all hub weights pointing to web page 1. Similarly, the hub value of Web Page 1 is the sum of the Authority values of the page to which it points.
Figure 6-14 Calculation of hub and authority weights
The other pages in the "expanded webpage set" also update the two weights in a similar way. When the weights of each page are updated, a round of iterative calculation is completed, in this case, the HITS algorithm evaluates the difference between the weights in the previous iteration and those after the current iteration. If the weights do not change significantly in general, the system is in a stable State, the calculation can end. The page is sorted from high to low Based on authority weights. Several pages with the highest weights are output as search results in response to user queries. If the weights of two rounds of calculation are found to be significantly different, the next iteration continues until the weights of the entire system are stable.
6.4.4 problems with the HITS algorithm
The HITS algorithm is a good algorithm. It is not only used in the search engine field, but also used for reference in many other computer fields such as natural language processing and social analysis, and achieved good application results. Despite this, there are still some problems with the HITS algorithm in the original version, and many subsequent Link Analysis Methods Based on the HITS algorithm are also proposed based on these problems in improving the HITS algorithm.
In summary, the HITS algorithm is insufficient in the following aspects:
1. Low computing efficiency
Because the HITS algorithm is a query-related algorithm, it must be calculated in real time after receiving the user's query, and the HITS algorithm itself needs to perform many rounds of iteration computing to obtain the final result, this results in low computing efficiency, which must be carefully considered in actual applications.
2. Topic drift
If the expanded webpage set contains some pages unrelated to the query topic and there are many links between these pages, therefore, the HITS algorithm is likely to give these irrelevant webpages a high ranking, leading to topic drift in the search results. This phenomenon is called "tightly-knit communityeffect ).
3. Results easily manipulated by the attacker
Hits is easily manipulated by writers. For example, writers can create a webpage and add a lot of content to websites that direct to high-quality webpages or famous websites. This is a good hub page, then the attacker points the webpage link to the cheating webpage, which can improve the Authority score of the cheating webpage.
4. unstable structure
The so-called structure is unstable, that is to say, in the original "expanded web page set", if you add or delete individual web pages or change the relationship between a few links, the ranking result of the HITS algorithm will be greatly changed.
6.4.5 comparison between the HITS algorithm and PageRank algorithm
The HITS algorithm and PageRank algorithm are two of the most basic and important algorithms for search engine link analysis. From the introduction of the two algorithms above, we can see that the two are very different in terms of basic concept models, computing ideas, and technical implementation details, the differences between the two are described one by one.
1. The HITS algorithm is closely related to the query requests entered by the user, and PageRank is irrelevant to the query requests. Therefore, the HITS algorithm can be used as the similarity calculation Evaluation Standard separately, while PageRank can only be used to evaluate webpage Relevance Based on content similarity calculation;
2. because the HITS algorithm is closely related to user queries, it must be computed in real time after receiving user queries, resulting in low computing efficiency. PageRank can be calculated offline after crawling, directly use the computing results online, resulting in high computing efficiency;
3. The HITS algorithm has a small number of computing objects. Only the links between webpages in the expanded set need to be calculated. PageRank is a global algorithm that processes all Internet page nodes;
4. Compared with the computing efficiency of the two and the size of the processing object set, PageRank is more suitable for deployment on the server side, and the HITS algorithm is more suitable for deployment on the client side;
5. The HITS algorithm has topic generalization problems, so it is more suitable for handling specific user queries. PageRank is more advantageous in processing broad user queries;
6. the HITS algorithm calculates two scores for each page, while PageRank only needs to calculate one score. In the search engine field, more attention is paid to the Authority value calculated by the HITS algorithm, however, in many other fields that use the HITS algorithm, the hub score also plays an important role;
7. From the perspective of link anti-cheating, PageRank is better than the HITS algorithm in terms of mechanism, while the HITS algorithm is more vulnerable to link cheating.
8. the HITS algorithm structure is unstable. When a small change is made to the link relationship in the "expanded webpage set", the final ranking will be greatly affected. PageRank is stable compared with hits, the root cause is the "remote jump" during PageRank computing ".