Link analysis algorithm: Hilltop algorithm
The hilltop algorithm was studied by Krishna Baharat in the year 2000, and was patented in 2001, but many people thought that the hilltop algorithm was researched by Google. Just Krishna Baharat later joined Google as a core engineer and then authorized Google to use it.
In contrast to the PageRank algorithm, Google realizes that the progress of the algorithm will bring a very important function to their search rankings. Google's hilltop algorithm is now better integrated with the old algorithm (PR algorithm) to work together. Based on the observation of the hilltop algorithm, it has made great progress in the 2000 when it was first designed. Obviously this is also one of the most important algorithms affected by the November 16, 2003 "Florida" update.
1. Basic idea of Hilltop algorithm
Hilltop combines the basic ideas of the two algorithms of hits and PageRank:
On the one hand, Hilltop is a link analysis algorithm related to user query request, absorbing the idea that the hits algorithm can obtain high quality related subset of pages according to user's query, that is, the link between the topic related pages is more valuable to the weight calculation than the topic irrelevant link value. Conforms to the subset propagation model, is a concrete example of the model;
On the other hand, in the process of weight transmission, hilltop also adopted the basic guiding ideology of PageRank, that is to determine the ranking weight of search results by the quantity and quality of the page into the chain.
2. Some basic definitions of the hilltop algorithm
Non-subordinate Organization page:
The non-subordinate organization page (non-affiliated pages) is a very important definition of the hilltop algorithm. To understand what is a non-subordinate organization page, first understand what is "subordinate organization website", so-called "subordinate organization website", that is, different websites belong to the same organization or its owner is closely related. Specifically, a site that satisfies any of the following rules will be considered a subordinate site:
Condition 1: The first three subnet segments of the host IP address are the same, for example: two sites with IP addresses of 159.226.138.127 and 159.226.138.234 are considered dependent sites.
Condition 2: If the primary domain name is the same in the site domain name, for example: Www.ibm.com and www.ibm.com.cn will be considered as subordinate organization website.
The non-subordinate organization page means that if two pages do not belong to a subordinate site, then the non-subordinate organization page. Figure 6-22 is related, as you can see, page 2 and Page 3 belong to the IBM Web page, so is the "subordinate organization page", and Page 1 and page 5, Page 3 and page 6 are "non-subordinate organization pages." It can also be seen that the "non-subordinate Organization page" represents a relationship of the page, a single page is indifferent to subordinate or non-subordinate organization page.
Figure 6-22 "subordinate organization pages" and "non-subordinate organization pages"
Expert page:
The "Expert page" (Export Sources) is another important definition of the hilltop algorithm. The so-called "expert pages", which are high-quality pages related to a topic, need to meet the following requirements: the pages that link to each other are "non-subordinate organization pages", and the pages that are pointed to are mostly similar to the "expert pages" topic.
Target Page Collection:
The hilltop algorithm divides the Internet pages into two types of subcollections, the most important subset of which is a subset of the Internet pages made up of expert pages, and the rest of the Internet pages in this subset as another collection, which is called the target page set (the destination Web Servers).
3. Hilltop algorithm
Figure 6-23 illustrates the overall flow of the hilltop algorithm.
1) Set up an expert page index: first, the "expert pages" sub-collection is filtered through a set of rules from a huge amount of internet pages and indexed separately for this collection of pages.
2) User query: Hilltop when receiving a query request from a user:
First, based on the subject of user queries, find some of the most relevant "expert pages" from the "expert pages" sub-collection and calculate the relevance score for each expert page,
The target page is then sorted according to the link relationship between the target page and these expert pages. The basic idea is to follow the link quantity hypothesis and quality principle of the PageRank algorithm, and pass the score of the expert page to the target page through the link relationship, and take this fraction as the target page and the user query relevance to the ranking score.
Finally, the system integrates relevant expert pages and high-scoring target pages to be returned to the user as search results.
Figure 6-23 Hilltop Algorithm flow
If, in the above process, hilltop cannot get a collection of expert pages large enough, the returned search result is empty. It can be seen that the hilltop algorithm pays more attention to the accuracy and accuracy of search results, does not consider whether the search results are enough or whether there are corresponding search results for most user queries, so the search results of many users ' queries are empty. This means that hilltop can be combined with a sorting algorithm to improve sorting accuracy, but it is not intended to be used as a standalone web-sorting algorithm.
4. Hilltop algorithm Flow
As can be seen from the overall process described above, the hilltop algorithm mainly consists of two steps: Expert page search and target page sorting.
Step One: Expert page search
The hilltop algorithm filters out 2.5 million-size Internet pages as a collection of "expert pages" from 140 million of web pages. The "expert pages" selection criteria are relatively loose, while pages that meet the following two criteria can be accessed in the "Expert pages" collection:
Condition 1: The page contains at least K-out chain, here the number of K-person for the designation;
Conditional 2:k All pages that are linked to each other are in accordance with the "Non-subordinate Organization page" requirement;
Of course, on this basis, you can set more stringent screening criteria, such as requiring that these "expert pages" contain links to the page, most of the topics involved and the subject matter of the expert page must be consistent or approximate.
By filtering out the "expert pages" based on the above criteria, you can index the expert pages separately, in which the index system indexes only "critical fragments" (key Phrase) in the page. The so-called "key Fragment", in the hilltop algorithm contains three kinds of information of the Web page: page title, H1 tag text and URL anchor text.
The "Critical fragment" of a Web page can dominate (Qualify) all the links contained within an area, and the "dominant" relationship represents a jurisdiction where different "key fragments" dominate the area of the link, specifically:
The page title can dictate all the links that appear in the page,
The H1 tag can dominate all links within <H1> and </H1>,
URL anchor text can only dominate its own unique link.
Figure 6-24 gives a "key fragment" to the link dominance, in the "Obama visit China" as the title of the page page, the title governs all the links appearing on this page, and the H1 label is limited to the scope of the label within the 2 links appear, for anchor text "Chinese leaders", The only thing that can dominate is the link itself. This dominance is defined and works for the second stage by passing the "expert page" score to the "target page".
Figure 6-24 "key fragment" link dominance relationship
The system receives the user query Q, assuming that the user query contains more than one word, hilltop how to score the "expert page"? The "Expert page" is scored primarily with the following three types of information:
1) "Key Fragment" contains the number of query words , including the more query words, the higher the score, if not contain any query words, then the "key fragment" does not score;
2) "Key fragment" of the type information itself , the title of the highest weight of the page, H1 tag second, again is the link anchor text;
3) User queries and "critical fragments" of the mismatch rate, that is, "key fragments" in the number of words that do not belong to the query term "key fragments" the total number of words, the smaller the better, the larger the score attenuation more;
Hilltop comprehensive consideration of the above three types of factors, to fit a scoring function to the "expert page" is related to the user query, the "expert page" to select the relevant score is high enough for the next step, that is, the "target page" for relevance calculation.
Step Two: target page sorting
The hilltop algorithm contains a basic hypothesis that if a "target page" is a high-quality search result satisfying a user's query, the sufficient and necessary condition is that the "target page" has a high-quality "expert page" link pointing to it. However, this hypothesis is not always established, such as the "expert page" link to the "target page" may not be closely related to user queries. Therefore, the hilltop algorithm needs to be carefully screened for the "expert page" at this stage to ensure that the target pages that are closely related to the query are selected.
Hilltop in this stage is based on the "expert page" and "target page" between the link relationship, on the basis of the "expert page" of the score to the link between the "target page." Before passing the score, the link relationship needs to be sorted first, and the "target page" for the "expert page" score will need to meet the following two points:
Condition 1: At least two "expert pages" have links to "target pages", and these two expert pages cannot be "subordinate organization pages", i.e. not from the same site or related sites. If it is a "subordinate organization page", then only one link can be kept, discarding the link with low weight;
Condition 2: "Expert page" and "target page" also need to meet certain requirements, that is, these two pages can not be "subordinate organization page";
In step one, given the user query, the hilltop algorithm has obtained the relevant "expert page" and its relevance to the query score, on this basis, how to "target page" relevance score? The above-listed condition 1 points out that the "target page", which can get the delivery score, must have multiple "expert pages" links, so the total spread score for the "target page" is the sum of the points transmitted by each "expert page" to which the link is directed. This is calculated when one of the "expert pages" is passed to the "target page" weight value:
A. Find the "key fragments" set that can dominate the target page in the "Expert page";
B. Statistics s contains user query words "key fragment" number t,t the greater the weight of the transfer;
C. " The expert page "passed to the target page" The score is: E*t,e for the expert page itself in the first phase of the calculation of the relevant score, T is a B-step calculation of the score,
We illustrate this in the specific example of Figure 6-25. Suppose there is a page p in the "expert pages" collection titled "Obama Visits China", which consists of a <H1> tag text and a separate link anchor text. The page contains three out-of-chain, two of which point to the page www.china.org in the target page collection, and the other to the page www.obama.org. The anchor text for the chain is: "Obama", "China" and "Chinese leader" respectively.
Figure 6-25 Hilltop algorithm score transfer
As can be seen from the link relationship shown in page p, the "key fragments" collection that can dominate www.china.org on this target page includes: {Chinese leader, China,
Next we analyze how the "expert page" P will pass the score to the "target page" with which it has a link when it receives the query. Assuming the system received a query request for "Obama", after receiving the query, the system first to find the "expert page" and give the score according to the above section, and page P is as a "expert page" one of the pages, and obtained the corresponding score s, we focus on the score spread step.
For the query "Obama", the "Key Fragment" collection of this query word contained in page p is: {Obama,
For user requests that contain multiple query terms, each query word is evaluated separately as above, and the delivery scores of multiple query terms are accumulated.
5. Hilltop insufficient in the application
Expert page search and determination of the key role of the algorithm, the quality of the expert page determines the accuracy of the algorithm, while the quality and fairness of the expert page is difficult to guarantee to a certain extent. Hiltop ignores the impact of most non-expert pages.
In Hilltop's prototype system, the expert pages accounted for only 1 of the entire page. 79%, the public opinion cannot be fully reflected.
The hilltop algorithm, when not getting enough subset of the expert pages (less than two expert pages), returns null, meaning that hilltop is suitable for the refinement of query ordering and cannot be overwritten. This means that hilltop can be combined with a page sorting algorithm to improve accuracy rather than as a separate page sorting algorithm.
Hilltop has a computational efficiency problem similar to the hits algorithm, because selecting a subset of the pages from the expert Pages collection based on the query subject is also run online, which, like the hits algorithm mentioned earlier, affects query response time. With the increase of the "expert page" collection, the scalability of the algorithm is deficient.
Reference: "This is the search engine: Core Technology detailed" sixth chapter
Link analysis algorithm: Hilltop algorithm