Link analysis algorithm: Salsa algorithm
The original intention of the salsa algorithm is to combine the main features of the PageRank and hits algorithms, which can take advantage of the characteristics of the hits algorithm and the query, and also adopt the PageRank "Random walk Model", which is the background of the salsa algorithm. Thus, the salsa algorithm fused the basic idea of PageRank and hits algorithm, from the actual effect, a lot of experimental data show that salsa search results are better than the first two algorithms, is one of the most effective link analysis algorithm.
From the overall calculation process, salsa can be divided into two large stages: the first is to determine the stage of the collection of objects, this phase is basically the same as the hits algorithm, the second stage is the link relationship propagation process, at this stage adopted the "Random Walk Model".
1. Determining the collection of calculated objects
PageRank Computing object is the Internet all Web pages, salsa algorithm and this is different, at this stage, and hits algorithm idea is roughly the same, but also to get the "extended collection of Web pages", then the relationship between the Web page is converted to two-figure form.
expand a collection of Web pages
After receiving the user query request, the salsa algorithm uses the existing search engine or retrieval system to obtain a batch of Web pages that are highly relevant to the user's query content, as a "root set". And on this basis, the "root set" within the page has a direct link to the page into the form of "expanded Web page collection" (see Figure 6.4.3-1). The final search results rankings are then obtained in the "expanded page collection" based on a certain link analysis method.
Convert to non-binary graph
After obtaining the "Extended Web page Collection", Salsa transforms the collection of Web pages into a binary graph based on the link relationship of the pages within the collection. The page is divided into two subcollections, one of which is the hub collection, and the other is the Authority collection. Which collection the page nodes belong to, according to the following rules:
If a Web page contains a chain that links to other nodes in the Extensions page collection, the page can be grouped into the hub collection;
If a Web page contains a chain that is pointed to by other nodes in the Extensions page collection, it can be grouped into the Authority collection.
As you can see from the above rule, if a page contains both a chain and a chain, it can be grouped together into two collections. At the same time, the link of the Web page within the hub sets up the edges within the two-point graph, and the "expanded page set" is converted to a two-point graph according to the above rules.
Figure 6-15 and Figure 6-16 give an example of this conversion process. Assume that the "expanded Web page collection" 6-15, consisting of 6 Web pages, its link relationship, and for the convenience of description, each page is given a unique number. Figure 6-16 is the result of converting the collection of Web pages in Figure 6-15 to a two-point chart. Take page 6 as an example, because it has a link to the page Node 3 and the page node 5, so you can put into the hub collection, also because the number 1, 3, 10 of the page node has links to the page node 6, so you can also put into the Authority collection. The two out-chain of the page node 6 is retained as an edge of the binary graph,
Figure 6-15 Example of an expanded Web page collection
However, it is important to note that, after converting to two, the original forward edge no longer retains the direction, and the hits algorithm is still reserved as a forward edge, which is slightly different from the salsa.
Figure 6-16 Two-part diagram
To this point, in addition to salsa the "expanded Web page collection" into a non-binary graph, and hits is still a binary graph, the other steps and processes, Salsa algorithm and hits algorithm exactly the same, and therefore, salsa to ensure that the user query related to the link analysis algorithm.
2. Link Relationship Propagation
In the link-relationship propagation phase, salsa abandoned the hits algorithm's hub node and the Authority node's mutually reinforcing assumptions, instead adopting the PageRank "random walk Model".
Link Relationship Propagation Conceptual model
6-16, assume that there is a browser, from a subset of random selection of a node (for the convenience of explanation, shown in the figure from the hub subsets of Node 1, the actual calculation is often from a subset of authority), if the node contains multiple edges, then the probability of a random edge of equal probabilities, Jumps from the hub subset to the nodes in the Authority collection, as shown in the figure is transferred from Node 1 to Node 3, and then jumps back from the authority subset to the subset of the hub, that is, node 3 jumps to node 6. So constantly transfer between the two subsets, forming the link relationship propagation mode of salsa itself.
Although it seems that the link propagation mode is different from the PageRank, in fact, the two are the same, the key point is that when it jumps from one node to another node, if it contains multiple alternative links, then randomly select a path, that is, during the weight propagation process, the weights are evenly distributed by all links. The hits algorithm is different, the hits algorithm belongs to the weight broadcast mode, and the weights of the node itself are completely propagated to the nodes with links, and the distribution is not based on the number of links.
The above-mentioned weight-value propagation model of salsa is different from that of hits model, and the hits model focuses on the relationship between the nodes of hub and authority, Salsa actually focuses on the node relationship between Hub-hub and Authority-authority, while the other sub-collection node acts as a transit bridge. Therefore, the above weights propagation model can be transformed into two similar sub-models, that is, the hub node diagram and the authority node diagram.
Authority node diagram
Figure 6-17 is converted from a 6-16 binary graph to the "authority node diagram", "Hub node diagram" similar to the two conversion process is similar, we take "authority node diagram" as an example of how to convert from a binary graph to a node diagram.
Figure 6-17 Authority node diagram
It is important to note that the probability of transferring from one node I to another in the Authority collection is different from the probability of transferring from node J to node I, which is asymmetric, so the transformed Authority node diagram is a forward graph, which represents the difference between the transfer probabilities.
For the Authority node diagram in Figure 6-17, the nodes in the graph are the nodes that belong to the authority subset in the binary graph, the key is how to establish the edges between nodes and how to calculate the transfer probabilities between nodes.
Establishment of edge in node diagram
In the "Authority node Diagram", Node 3 has an edge pointing to Node 5, because in the binary diagram, the node 3 through the hub subset of the node 6 relay, you can access the node 5, so there is an edge between the two established.
It is important to note that in the binary diagram, for a node in the Authority collection, it is bound to return itself after the node of the hub subset, so it must contain a directed edge to itself. Node 1 because only the Transit Node 2 makes it return the authority sub-set itself node, so only point to its own side, and other nodes have no edge contact, so the example of the "authority node Diagram" is composed of two connected sub-graphs, a node only 1, Another connected sub-graph consists of several remaining nodes.
Probability of transition between nodes
As for why "Authority node diagram", the transfer probability of node 3 to node 5 is 0.25, because the weight propagation model of salsa follows the "Random walk Model" as described earlier. In the binary diagram of Figure 6-16, the process of transferring from node 3 to node 5, node 3 has two edges to make a choice to jump to the hub subset, so the selection probability of each edge is 1/2, you can choose one of the edges to reach node 6, again, from node 6 to jump back to the authority subset, Node 6 also has two edges selectable, and the probability of selecting each edge is 1/2. So starting from node 3, the probability of jumping to node 5 via node 6 is the product of two edge weights, which is 1/4.
For directed edges pointing to itself, the weight calculation process is similar, we still take node 3 as an example, pointing to its own forward side represents the node 3 from the Authority sub-set, the probability of returning node 3 through a subset of the hub node. As can be seen from the 6-16 binary graph, there are two paths to complete this process, one from node 3 to Node 1, the other is returned from node 3 through node 6, and the probability of each path is the same as the calculation method described above. Since the probability of two paths is 0.25, the probability of node 3 returning itself is the sum of two path probabilities, which is 0.5. The transfer probability of the other edges in the graph is also the class of the method.
Once the Authority node diagram is established, you can use the random walk model on the graph to calculate the authority weights for each node. In the actual calculation process, salsa the search results sorting problem further into the Authority node matrix of the main rank problem, the main rank of the matrix is the corresponding authority score of each node, according to the authority score from high to low ranking, you can get the final search results.
3. Authority weight Calculation
Figure 6-18 Calculation formula of salsa node weight value
After the mathematical derivation, we can get the formula of calculating the authority weights of the salsa and the main rank equivalence of the matrix. Figure 6-18 shows how the authority weights of a Web node in the salsa algorithm are calculated. As the formula in the upper right corner shows, determining the authority weight of a page I involves 4 factors:
Total number of nodes included in the Authority subset | a|. In fact, this factor for any node in the authority set is the same, so for the end of the authority weight of the node is not affected by the order, but the guarantee right is worth to be divided between 0 to 1, to be able to represent the role of weights in probability form;
Number of nodes included in the connected graph of page I | aj|. The larger the number of nodes in the connected graph, the greater the authority weight of the Web page;
Total number of inbound links included in the connected graph of page I | ej|. The more the total number of links in the connected graph of a webpage is, the greater the authority weight of the Web page;
Number of links in page I | bi|. The more nodes in the chain, the greater the authority weight, which is the only one that is related to the attributes of the node itself. Thus, the salsa weight calculation and the number of nodes into the chain is proportional.
Previously, the "authority node diagram" in Figure 6-17 consists of two connected sub-graphs, one is composed of a unique node 1, the other is composed of nodes 3, 5, 63 nodes, and two connected sub-graphs are also circled separately in Figure 6-18.
We take node 3 as an example to see its corresponding four calculation factors value:
The subset of authority consists of 4 nodes;
The connected graph of node 3 contains 3 nodes;
Node 3 connected graph has a total of 6 into the chain;
The number of nodes in node 3 is 2;
Therefore, the authority weight value of Node 3 is: (3/4) * (2/6) = 0.25. The calculation of other node weights is similar to this. Salsa the output from high to low based on the authority weights of the nodes, which is the search result.
From the above weight calculation formula can be inferred: if the entire authority subset of all nodes to form a complete connected graph, then in the calculation of authority weights, for any two nodes, 4 factors in addition to the number of nodes into the chain, the other three factors are always the same, that is, only the number of the chain function , the salsa algorithm is degraded to the algorithm that determines the sort order based on the number of nodes in the chain.
From the Salsa calculation Authority score process can be seen, salsa algorithm does not need to be like the hits algorithm iterative calculation, so from the point of view of computational efficiency faster than the hits algorithm. In addition, the salsa algorithm solves the problem of topic drift of the computational results of the hits algorithm, so the search quality is better than the hits algorithm. The salsa algorithm is one of the most effective link algorithms at present.
Reference documents:
"This is the search engine: The core technology detailed"
Link analysis algorithm: Salsa algorithm