At present, search engine cheating means a variety of, endless, as should be the other side of the search engine, but also adjust the technical ideas, and constantly targeted anti-cheating technical solutions, so if the anti-cheating technology solution, will find a lot of technical methods, clear ideas are not easy.
However, if the majority of anti-cheating techniques in-depth analysis, you will find in the overall technical thinking there are still rules to follow. From the basic point of view, the anti-cheating method can be broadly divided into the following three kinds: "Trust propagation Model", "non-trust propagation model" and "Anomaly Discovery model". The first two technology models can be further abstracted into the "subset propagation model" referred to in the chapter "link Analysis", in order to simplify the explanation, it is not covered here, but the two sub-models are listed directly. The relationship between the concrete algorithm and these models is helpful to establish a clear concept for the macro-thinking and mutual connection of anti-cheating algorithm.
8.5.1 Trust Propagation Model
Figure 8-6 shows the "Trust propagation model". The so-called "trust Communication Model", the basic idea is as follows: In a huge amount of web data, through a certain technical means or artificial semi-artificial means, from which to screen out some of the completely trustworthy page, that is certainly not cheating page (can be understood as a white list), the algorithm to these white list pages as a starting point, Give the page node in the whitelist a higher trust score, whether the other pages cheat, to be based on its and Whitelist node link relationship to determine. A whitelist node spreads the trust score outward through a link relationship, and if a node finally gets a higher trust score than a certain threshold, it is considered no problem, and a page below that threshold is considered a cheat page.
Figure 8-6 Trust Propagation model
Many algorithms follow the above description in the overall process and algorithm framework, and their differences are often reflected in the following two aspects:
A. How to get the initial Trust page face collection, different method means may vary.
B. How trust is disseminated and different approaches may have subtle differences.
8.5.2 Non-trust propagation model
Figure 8-7 Non-trust propagation model
Figure 8-7 shows the overall framework of the "No Trust propagation model". From a large technical framework, it is similar to the "Trust propagation model", the biggest difference being that the initial subset of pages is not a trustworthy page node, but rather a collection of pages that are not trustworthy (which can be understood as blacklists), which are not trusted pages. Give the Blacklist the page node does not trust the score, through the link relationship to propagate this distrust relationship, if the last page node's distrust score is greater than the set threshold, it will be considered a cheat page.
Similarly, many algorithms can be classified into this model framework, but the specific implementation of the details of the differences, the overall thinking is basically the same.
8.5.3 Anomaly Discovery Model
Anomaly Discovery model is also a highly abstract algorithm framework model, its basic hypothesis is that: cheating Web pages must exist in different from the normal characteristics of the page, this feature may be content, there may be links between the aspects. The process of making a specific algorithm is often to find a collection of cheating pages, analyze the characteristics of their anomalies, and then use these characteristics to identify the cheat page.
Specifically, the framework model can be subdivided into two seed models, which have different perspectives on how to judge anomalies. A more intuitive perspective, that is, directly from the Cheat page contains unique features to build the algorithm (see figure 8-8); Another point of view is that the abnormal Web page is a cheat page, that is, through statistics and other means to analyze the characteristics of normal Web pages, if the page does not have the characteristics of these normal pages, is considered a cheat page (see figure 8-9). Figure 8-8 and figure 8-9 reflect these two different ideas.
Figure 8-8 Anomaly Discovery Model One
Figure 8-9 Anomaly Discovery Model II
Despite the variety of anti-cheating algorithms, no matter what specific algorithm to take, in fact, there are some basic assumptions, often by anti-cheating algorithm used by the basic assumptions are:
- Although cheat pages like to link links to high-quality web pages, few high-quality web pages link to cheating sites;
- Cheat pages tend to point to each other;
The basic ideas of many algorithms are constructed from these basic assumptions.
Java enterprise-Class generic rights security framework source SPRINGMVC MyBatis or Hibernate+ehcache Shiro Druid Bootstrap HTML5
"Java Framework source code download"
Search engine anti-cheating: the overall technical thinking