International - English

Cart Console

Topic Center

Contact Sales

Home > Website Builders > Website Operations

Search engine Knowledge: Web page's weight-checking technology

Last Update:2014-12-19 Source: Internet

Author: User

Keywords Search engines algorithms

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall

Duplicate Web content is very harmful to search engines. The existence of duplicate pages means that these pages will be processed by the search engine more than once. What's more harmful is that search engine indexing may index two identical pages in the index library. When someone queries, a duplicate page link appears in the search results. So the heavy web pages are harmful both from the search experience to the quality of system efficiency searches.

Web page search technology originated from the replication detection technology, that is, to determine whether a file content plagiarism, copying another one or more files technology.

1993 Arizona University's Manber (Google Now Vice President, engineer) launched a SIF tool to look for similar files. 1995 Stanford University of Brin (Sergey Brin,google founder) and Garcia-molina and others in the "Digital Book View" project, the first proposed text replication detection mechanism cops (copy homeowner system) The system and the corresponding algorithm [Sergey Brin et al 1995]. After this detection repetition technique is applied to the search engine, the basic core technology is similar.

Web pages are different from simple documents, and the special attributes of a Web page have tags such as content and formatting, so the same similarity in content and format makes up 4 of similar types of Web pages.

1, two page content format exactly the same.

2, two page content is the same, but the format is different.

3, two pages part of the same content and the same format.

4, two page parts are important the same but the format is different.

Implementation method:

Page check weight, first of all, the Web page is organized into a title and body document, to facilitate the search heavy. So the Web page check heavy again called "Document check Weight". "Document check weight" is generally divided into three steps,

Feature extraction.

Second, similarity calculation and evaluation.

Third, the elimination of heavy.

1. Feature Extraction

When we judge the similarity, we can usually compare the invariant features, and the first step of file checking is to feature extraction. That is, the content of the document is decomposed, which is represented by a set of features that make up the document, and this step is to compare and compute the similarity of the later features.

Feature extraction has many methods, we mainly say two kinds of classical algorithms, "I-match algorithm", "Shingle algorithm."

The "I-match algorithm" is not dependent on the complete information analysis, but uses the statistical characteristics of the data set to extract the main features of the document and discard the non main features.

The "shingle algorithm" is used to extract multiple feature words and compare the similarity of two feature sets to achieve document weight checking.

2. Calculation and evaluation of similarity

After the feature extraction, we need to carry on the characteristic contrast, because the second step of webpage checking is similarity calculation and evaluation.

I-match algorithm has only one feature, when input a document, according to the terms of the IDF (inverse text frequency index, inverse document frequency abbreviation for IDF) filter out some key features, That is, a particular high and low frequency words in an article often do not reflect the nature of this article. So the high-frequency and low-frequency words are removed from the document and the unique hash value of the document is computed (hash simply maps the data value to the address). The value of the data as input, after calculation can get the address value. , documents with the same hash value are duplicated.

Shingle algorithm is to extract a number of features for comparison, so the processing is more complex, the comparison method is exactly the number of shingle. Then divided by the total number of shingle in two documents minus the number of consistent shingle, this method calculates the value "Jaccard coefficient", which can be used to determine the similarity of the set. The intersection of the set of methods for calculating the Jaccard coefficients is divided by the set.

3. Weight dissipation

For deletion of duplicate content, the search engine takes into account many factors, so the simplest and most practical method is used. The first crawler-crawled page also guarantees a high degree of priority in preserving original Web pages.

Web page Check heavy work is indispensable in the system, delete the duplicate page, so the other links of the search engine will also reduce a lot of unnecessary trouble, save the index storage space, reduce the query cost, improve the efficiency of PageRank calculation. Convenient for search engine users.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Taking wedding photography website as an example to analyze t... 08-18

How to use the following four ways to promote their own web s... 08-18

How to find a breakthrough in comparison to achieve the effec... 08-18

Local website content access is the site of the first major o... 08-18

Old and new reasons for web site snapshot analysis and solutions 08-18

How to determine the daily number of foreign chains according... 08-18

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Hot Article

Hot Tags

computing conference access forum computer class data get http html applications

Popular Keywords

direct digital landing development documentation data user director of marketing deploy it ddos how to description of products and services ddos information data website domain to dns

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Search engine Knowledge: Web page's weight-checking technology

Contact Us

Hot Article

Hot Tags

Popular Keywords

Recommend Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support