Research Topic 2 of search engine algorithms: Analysis of hits algorithms and derived Algorithms

Source: Internet
Author: User

An Algorithm for analyzing the importance of a Web page is used to measure the importance of a Web page based on the inbound (hyperlink to this web page) and outbound (from this web page to another web page. The most intuitive meaning is that if a webpage is important, the importance of the webpage to which it points is also high. When an important webpage is referred to by another webpage, it indicates that the webpage directed to it is also important. The point to another web page is defined as the hub value, and the point is defined as the Authority value.

UsuallyHITS algorithmIt works in a certain range. For example, if a Web page with program development as the subject points to another web page with program development as the subject, the importance of another web page may be relatively high, however, web pages pointing to another shopping category are not necessarily.

After a limited range, a matrix is created based on the outbound and inbound degrees of the Web page. The Authority and hub values of the two vectors are constantly updated through matrix iteration and defined convergence thresholds until convergence. The HITS algorithm can also be extended to other similar sorting systems.

Hits variants

Most of the problems encountered by the HITS algorithm are because hits is a purely link-based Analysis Algorithm without considering the text content. after kleberger proposed the HITS algorithm, many researchers improved hits and proposed many variant hits algorithms, mainly including:

Improvement on hits by Monika R. henzinger and Krishna Bharat

Monika R. henzinger and Krishna Bharat have been improved in [7] for the 2nd problems encountered by hits mentioned above. Assuming that host a has K web pages pointing to a document D on host B, the contribution of K documents on host a to authority B is 1 in total, each document contributes 1/K instead of 1/K in each hits. Similarly, for the hub value, assuming that a document T on host a points to M documents on host B, M documents on host B contribute 1 to T's hub value in total, each document contributes 1/m.

ARC algorithm

The clever engineering group at the IBM Almaden research center proposed the arc (Automatic Resource compilation) algorithm to improve the original hits, when the initial value of the link matrix corresponding to the webpage set is assigned, the anchor text of the link is used to adapt different links with different weights.

The differences between the ARC algorithm and hits are as follows:

1. when the root set S is expanded to T, hits only extends the webpage with the link path of the root set 1, that is, the webpage directly adjacent to S, in arc, the extended link length is increased to 2, and the expanded webpage set is called the Add set ).

2. In the HITS algorithm, the matrix value corresponding to each link is set to 1. In fact, the importance of each link is different. The ARC algorithm considers the text around the link to determine the importance of the link. Consider the link p-> q, P contains several link tags, text 1 anchor text 2, set query item t in text 1, anchor text, text 2, if the number of occurrences is N (t), w (p, q) = 1 + N (t ). The length of text 1 and text 2 has been tested to be 50 bytes [10]. Construct the matrix W. If there is a webpage I-> J, WI, j = W (I, j), otherwise WI, j = 0, H is set to 1, Z is the transpose matrix of W, and the following three operations are performed iteratively:

(1) A = wh (2) H = ZA (3) Normalize A, h

3. the goal of the ARC algorithm is to find the first 15 most important webpages, which only requires the relative size of the first 15 values of A/h to remain stable, and the entire convergence of A/H is not required, in this way, the number of iterations in 2 is small, and the number of iterations is 5 in [10 ].

Hub-averaging-kleberger Algorithm

Allan Borodin pointed out a phenomenon in [11], with m + 1 hub webpage and m + 1 authoritative webpage. The previous m hub points to the first authoritative webpage, the m + 1 hub web page points to all m + 1 authoritative web page. Obviously, according to the HITS algorithm, the first authoritative webpage is the most important and has the highest authority value. This is what we hope. However, according to hits, the m + 1 hub page has the highest hub value. In fact, the m + 1 hub page points to the first authoritative page with a high authoritative value, it also points to other webpages with low authoritative values. Its hub value should not be higher than the hub value of the previous m webpages. Therefore, Allan Borodin modifies the O operation of hits:

O operation: n is the number of (v, U)

After adjustment, the hub value of a webpage with a higher authoritative value is higher than that of a webpage with a higher authoritative value and a lower authoritative value, this algorithm is called the hub-averaging-kleberger algorithm.

ARC algorithm

The clever engineering group at the IBM Almaden research center proposed the arc (Automatic Resource compilation) algorithm to improve the original hits, when the initial value of the link matrix corresponding to the webpage set is assigned, the anchor text of the link is used to adapt different links with different weights.

The differences between the ARC algorithm and hits are as follows:

1. when the root set S is expanded to T, hits only extends the webpage with the link path of the root set 1, that is, the webpage directly adjacent to S, in arc, the extended link length is increased to 2, and the expanded webpage set is called the Add set ).

2. In the HITS algorithm, the matrix value corresponding to each link is set to 1. In fact, the importance of each link is different. The ARC algorithm considers the text around the link to determine the importance of the link. Consider the link p-> q, P contains several link tags, text 1 anchor text 2, set query item t in text 1, anchor text, text 2, if the number of occurrences is N (t), w (p, q) = 1 + N (t ). The length of text 1 and text 2 has been tested to be 50 bytes [10]. Construct the matrix W. If there is a webpage I-> J, WI, j = W (I, j), otherwise WI, j = 0, H is set to 1, Z is the transpose matrix of W, and the following three operations are performed iteratively:

(1) A = wh (2) H = ZA (3) Normalize A. The goal of the h3.arc algorithm is to find the top 15 most important webpages, only the first 15 values of A/H are required to maintain a stable relative size. The entire convergence of A/H is not required, so that the number of iterations in 2 is small, [10] indicates that it can be iterated five times, so the ARC algorithm has a high computing efficiency and the overhead is mainly on the extension root set.

Hub-averaging-kleberger Algorithm

Allan Borodin pointed out a phenomenon in [11], with m + 1 hub webpage and m + 1 authoritative webpage. The previous m hub points to the first authoritative webpage, the m + 1 hub web page points to all m + 1 authoritative web page. Obviously, according to the HITS algorithm, the first authoritative webpage is the most important and has the highest authority value. This is what we hope. However, according to hits, the m + 1 hub page has the highest hub value. In fact, the m + 1 hub page points to the first authoritative page with a high authoritative value, it also points to other webpages with low authoritative values. Its hub value should not be higher than the hub value of the previous m webpages. Therefore, Allan Borodin modifies the O operation of hits:

O operation: After the number of N values (v, u) is adjusted, the hub value of a webpage that only points to a higher authoritative value is higher than the hub value of a webpage that points to a higher authoritative value and points to a lower authoritative value. This algorithm is called hub-averaging-kleberger) algorithm.

Threshold (threshhold-kleberger) Algorithm

In [11], Allan Borodin also proposed three threshold control algorithms, namely the hub threshold algorithm, the Authority threshold algorithm, and the full threshold algorithm combining the two.

When calculating the authority of Web Page P, the contribution of all the hub values pointing to it is not considered, but the contribution of the web page whose hub value exceeds the average value. This is the hub threshold method.

The Authority threshold algorithm is similar to the Hub threshold method. without considering the contribution of the authority of all web pages directed by P to the hub value of P, only the contribution of the First K authoritative web pages to its hub value is calculated, this is the prerequisite for algorithm-based goals to find the most important K authoritative webpages.

The algorithm that uses both the authority threshold algorithm and the hub threshold method is the full threshold algorithm.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.