Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
Shingle algorithm is a search engine to remove the same or similar pages of one of the basic algorithms, do SEO aggregation page when how to make the page does not repeat? How to deal with the problem of repetition? The shingle algorithm can be pushed back to get some inspiration.
shingle [ˈʃɪŋgəl] in English means tiles that cover each other. First, an example is given to illustrate the shingle algorithm:
Suppose there is a, B two document title, the title of a document is: The Ming telephone booking train tickets can be national access to take the ticket time delay 12 hours; b The title of the document is: Train ticket telephone booking to achieve nationwide access to the online pre-sale period extension.
How does the search engine know whether these two document titles are duplicates? For example, we can cut the 2 Chinese characters into one shingle method:
For documents of length L, every N Chinese character is cut into a shingle, so that the total cut to l-n+1 Shingle,a document title cut into l-n+1=21-2+1=20 shingle,b document title cut into l-n+1=20-2+1=19 shingle.
A, b two document headings common shingle has 7 bold on the chart: telephone, word order, train, ticket, national, Country pass, pass take.
A, B two document titles altogether have 20+19-7=32 a shingle.
However, a, b two document headings in common shingle, divided by, a, B two document titles altogether have shingle, is the jaccard coefficient of these two document headings, can use to judge A, b two document title similarity degree.
A, b two document title Jaccard coefficient =7/(20+19-7) =0.21875
From the two document titles, you can extend to two page documents, and then extend to n pages, to determine whether the page is similar to the page by whether the Jaccard factor meets the same criteria.
This is the shingle algorithm, where the intersection of two sets is divided by the set of two sets, and the Jaccard coefficients are obtained to determine whether two sets are duplicated by determining whether the jaccard coefficients are greater than a certain number.
Reverse shingle algorithm, if the Jaccard coefficient is less than a certain number, do not repeat, first to each document set into a number of shingle, and then 22 to calculate the Jaccard coefficient, if less than a certain number of pages can be generated.
I did a project before the use of a method, although relatively stupid, but also practical, share:
If the Beijing Film category has 100 Group Buy list, now to the next figure to the right of these words design aggregation page, each page shows 10 lists, assuming the Jaccard coefficient is greater than 0.3 to determine the page repeat, how to generate a repeat page?
The following figure shows the title and long title of the list (assuming that the SEO aggregate page with the long title, because the long title text is not single, text volume is also large):
Each ID is unique, and each ID's title and long title can be approximated to a single one, which can be simplified to allow the number of lists with the same ID to solve the problem of duplication.
It means that each page shows 10 lists, every two pages can not have >=3.33 ID is the same, that is, 22 page ID comparison, all the IDs are different can generate pages, only 1 ID the same can generate pages, only 2 ID same can generate page, only 3 ID same can generate page , the page is not generated if there are more than or equal to 4 IDs.
Later will spend most of the spare time in algorithm, technology, SEO exchange, expect to have more good things to share to everyone.
There are questions welcome letters Hui Weibo: http://1.t.qq.com/chenhui8com