Webpage de-duplication: Search Engine duplicate webpage Algorithm

Source: Internet
Author: User
Tags md5 encryption
Webpage de-duplication: Search Engine duplicate webpage algorithm (reprinted)

-2-28 11:26:59 search engine algorithm-copying a Webpage Search Engine-copying a webpage Algorithm

During the Spring Festival, I read some books on the basic principles of search engines. Next I will write down the algorithms for copying web pages.

Keywords:Search engine, copy webpage, algorithm, information fingerprint, fingerprint, keyword

Search engines generally use the following idea to determine how to copy a webpage: Calculate a groupInformation fingerprint (fingerprint)If the two webpages have a certain number of identical information fingerprints, the two webpages are considered highly overlapping, that is, the two webpages are copied.

Many search engines use different methods to determine content replication, mainly because of the following two differences:

1. ComputingInformation fingerprint (fingerprint)Algorithm;
2. parameters used to determine the degree of similarity of Information fingerprints.

Before describing a specific algorithm, let's clarify two points:
1. What is an information fingerprint?An information fingerprint is used to extract certain information from the body of the webpage, such as keywords, words, sentences, or paragraphs, and their weights on the webpage. It is encrypted, such as MD5 encryption, to form a string. The information fingerprint is similar to the same person's fingerprint. As long as the content is different, the information fingerprint is different.
2. The information extracted by the algorithm is not for the whole webpage, but for the common part of the website, such as navigation bar, logo, and copyright (these are called "noise" of the webpage ") the remaining text is filtered out.

Piecewise Signature Algorithm

This algorithm divides a webpage into N segments according to certain rules and signs each segment to form an information fingerprint for each segment. If the N information fingerprints contain m at the same time (M is the threshold value defined by the system), the two are considered to be copies of the web page.

This algorithm is a good algorithm for small-scale judgment and copying web pages. However, for a massive search engine like Google, the complexity of the algorithm is quite high.

Keyword-based webpage replication algorithm

Search engines like Google write down the following information when capturing webpages:

1. keywords (Chinese Word Segmentation technology) appearing on the webpage and the weight of each keyword (keyword density );
2. Extract meta descr into ption or 512 bytes of valid text from each webpage.
About, Baidu is different from Google. Google extracts your meta descr partition ption. If you do not query the 2nd bytes related to the keyword, Baidu directly extracts the latter. All of you have experienced this.

In the following algorithm description, we define several information fingerprint variables:

Pi indicates the page I;
The N keywords with the highest weight on the webpage constitute a set of Ti = {T1, T2,... tn}. Their weights are Wi = {W1, W2,... wi}
The abstract information is represented by DES (PI). The strings assembled by the first n keywords are represented by Con (TI), and the strings formed after the N keywords are sorted are represented by sort (TI.

The above fingerprint information is encrypted using the MD5 function.

Keyword-based copy web page algorithms include the following five:
1. MD5 (des (PI) = MD5 (des (PJ), that is, the abstract information is the same, and the I and j webpages are considered as copying webpages;
2. MD5 (con (Ti) = MD5 (con (TJ). The first n Keywords of the two webpages and their weights are sorted in the same order, so they are regarded as copying webpages;
3. MD5 (sort (Ti) = MD5 (sort (TJ). The first n Keywords of the two webpages are the same, and the weights can be different. They are also considered as copying webpages.
4. MD5 (con (Ti) = MD5 (con (TJ) and the sum of the square of Wi-WJ divided by the square of WI and Wj is smaller than a threshold value, the two are considered to be webpage replication.
5. MD5 (sort (Ti) = MD5 (sort (TJ) and the sum of the square of Wi-WJ divided by the square of WI and Wj is smaller than a threshold value, the two are considered to be webpage replication.

The threshold value a of 4th and 5th is mainly because many web pages are accidentally injured under the previous judgment condition. Search Engine Development adjusts the weight distribution ratio to prevent accidental injury.

This is the de-duplication algorithm of Peking University Skynet search engine (refer to "Search Engine-principles, technologies and systems"). When these five algorithms are running, the algorithm effect depends on N, that is, the number of keywords. Of course, the more you select, the more accurate the judgment, but the slower the computing speed. Therefore, a balance between computing speed and de-duplication accuracy must be considered. According to Skynet's test results, about 10 keywords are the most appropriate.

Postscript
The above certainly cannot cover all aspects of a large search engine copying web pages. They must have some auxiliary information fingerprint judgment. This article serves as an idea to give a search engine optimization idea.

Copy From: sheawey Search Engine

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.