I-match algorithm for analyzing the page-weight algorithm of search engine

Source: Internet
Author: User
Keywords Algorithm

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

There is a large number of duplicate pages in the Internet, according to statistics that approximately repeat the number of pages occupy the total number of sites 29%, while the full repeat page occupies 22%. These duplicate pages for search engines occupy a lot of resources, so the search engine on the page to the weight of the search engine is also an important algorithm. So today we analyze the search engine page to-i-match algorithm.

For the I-match algorithm is mainly based on large-scale collection of text statistics, for all the words appearing in the text, according to the word of the IDF (inverse text frequency factor) to the order from high to low, remove the highest score and lowest scoring words, keep the remaining words most characteristic dictionary. This step is mainly to delete the text of unrelated keywords, retain important keywords. Here is the I-match process diagram:

  

I-match process Diagram

After getting the global feature dictionary, the pages that need to be heavy are scanned to get all the words that appear on the page, which are filtered according to the feature dictionary: the words that remain in the feature dictionary are used to express the main contents of the document and to delete the content that is not present in the feature dictionary. After extracting the corresponding feature words, the hash function is used to hash the feature words, and the obtained value is the text fingerprint of the document.

After all the documents have been counted, if you want to see whether two documents are repeated, you only need to see if the text fingerprint of the document is approximate, if the approximate two document duplicates. This method of comparison is very intuitive and efficient, and the effect is more obvious.

We SEO in the creation of false original when often put the words and paragraphs to replace the position, in order to deceive the search engine that this is an original article, but i-match to document the word order is not sensitive. If the two articles contain the same words are only the replacement of the position of the word, then the i-match algorithm or two articles are considered to be repeated articles.

But there are still many problems with this algorithm. 1, easy to appear misjudged. Especially in the face of the short text, the short text of the word itself is relatively small, after the feature dictionary filter to retain only a small number of special testimony, so easy to two original duplicates of the document mistaken for repetition, this for the short file is more serious situation. 2. Poor stability, sensitive to document modification. If you make a small change to document A and generate document B, the algorithm is likely to determine that two documents are not duplicates. For example, we add a word h to document A and generate document B. I-match algorithm in the calculation of the time, two articles is only a single word h, if the word h is no longer feature in the dictionary, then the two articles in the special testimony of the same is determined to duplicate the document, but this happens, the word h appears in the feature dictionary, then text b than document a more than a feature, The algorithm is likely to determine that two documents are not duplicated. This is the biggest problem of I-match.

Based on the problem of I-match, some people have improved the algorithm. The original algorithm is very sensitive to the change of the document, mainly because of the over-reliance on the single feature dictionary, the improved I-match is to reduce the dependence on the feature dictionary. Multiple feature dictionaries can be used, as long as each feature dictionary is approximately similar to ignore small differences.

The modified I-match algorithm is mainly: similar to I-match original algorithm, form a feature dictionary, in order to distinguish from other dictionaries can become the main feature dictionary, and then derive some small auxiliary characteristic dictionaries according to the main characteristic dictionary. In order to ensure that the main body of the feature dictionary is the same, some dictionary items can be randomly deleted from the main feature dictionary and then a new feature dictionary is called the Auxiliary feature Dictionary, which can be obtained by repeating several times. When two documents are compared, the main feature dictionary and auxiliary feature Dictionary can be compared together, as long as the general content of each feature dictionary is guaranteed to be the same, ignoring small differences can determine whether the document is duplicated. The following diagram is an improved schematic of I-match:

  

I-match algorithm Improvement

In the illustration above, there are two auxiliary feature dictionaries, the main feature dictionary discards feature 5 and feature 6 to form an auxiliary feature dictionary 1, and the main feature dictionary discards feature 2 and feature 3 to form the auxiliary feature Dictionary 2. And according to the three characteristic dictionaries formed the text fingerprint respectively. If two documents have two fingerprint information the same then you can determine two document duplication.

The improved I-match algorithm greatly improves the success rate of document-going weight and increases the stability of the algorithm.

SEO Inspired: The traditional false original articles, a simple article to modify, and then make some small changes, and then the middle paragraph adjustment order, this search engine is meaningless, or you can determine whether two articles repeat. Because we to the construction of the article or to the original, or to the original article to make a relatively large change, so that the characteristics of two articles dictionary changed.

Word Explanation:

IDF Reverse document Frequency factor: a measure of the general importance of a word, the IDF of a particular word, the number of total files divided by the number of documents containing the word, the resulting quotient obtained.

  

Indicates the total number of documents n represents the number of documents that contain the entry K.

This article by http://www.youzu.com feeds, reprint please keep the link thank you!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.