Webpage de-duplication-special processing of image sites

Source: Internet
Author: User

Keyword-based webpage replication algorithm

I think the algorithms mentioned above are based on this document. For large search engines, there are some performance gaps, so some optimizations are aimed at the keywords of web pages, or the meta description of the webpage. Therefore, the following technologies must be supported:

1. keywords (Chinese Word Segmentation technology) appearing on the webpage and the weight of each keyword (keyword density );
2. Extract valid text in several (for example, 512) bytes of meta descr into ption or each webpage.

In the following algorithm description, we define several information fingerprint variables:

Pi indicates the page I;
The N keywords with the highest weight on the webpage constitute a set of Ti = {T1, T2,... tn}. Their weights are Wi = {W1, W2,... wi}
The abstract information is represented by DES (PI). The strings assembled by the first n keywords are represented by Con (TI), and the strings formed after the N keywords are sorted are represented by sort (TI.

The above fingerprint information is encrypted using the MD5 function.

Keyword-based copy web page algorithms include the following five:
1. MD5 (des (PI) = MD5 (des (PJ), that is, the abstract information is the same, and the I and j webpages are considered as copying webpages;
2. MD5 (con (Ti) = MD5 (con (TJ). The first n Keywords of the two webpages and their weights are sorted in the same order, so they are regarded as copying webpages;
3. MD5 (sort (Ti) = MD5 (sort (TJ). The first n Keywords of the two webpages are the same, and the weights can be different. They are also considered as copying webpages.
4. MD5 (con (Ti) = MD5 (con (TJ) and the sum of the square of Wi-WJ divided by the square of WI and Wj is smaller than a threshold value, the two are considered to be webpage replication.
5. MD5 (sort (Ti) = MD5 (sort (TJ) and the sum of the square of Wi-WJ divided by the square of WI and Wj is smaller than a threshold value, the two are considered to be webpage replication.

The threshold value a of 4th and 5th is mainly because many web pages are accidentally injured under the previous judgment condition. Search Engine Development adjusts the weight distribution ratio to prevent accidental injury.

This is the de-duplication algorithm of Peking University Skynet search engine (refer to "Search Engine-principles, technologies and systems"). When these five algorithms are running, the algorithm effect depends on N, that is, the number of keywords. Of course, the more you select, the more accurate the judgment, but the slower the computing speed. Therefore, a balance between computing speed and de-duplication accuracy must be considered. According to Skynet's test results, about 10 keywords are the most appropriate.

Resource:
Scam (Stanford copy analysis mechanisms) =. http://infolab.stanford.edu /~ Shiva/scam/scaminfo.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.