Find the document pairs with high similarity from the 1 million documents

Source: Internet
Author: User
When we want to find a document pair with a relatively high level from 1 million documents, we need to compare them with each other for a total of billion times. If each comparison takes 1 microsecond, it takes six days to complete the calculation. Application of the problem: 1. check and review of the paper. I have heard of this word since I was a university student. This is an application of this question,

When we want to find a document pair with a relatively high level from 1 million documents, we need to compare them with each other for a total of billion times. If each comparison takes 1 microsecond, it takes six days to complete the calculation. Application of the problem: 1. check and review of the paper. I have heard of this word since I was a university student. This is an application of this question,

When we want to find a document pair with higher similarity items from the first article, we need to compare them with each other for a total of 1 million billion times. If each comparison takes 1 microsecond, it takes six days to complete the calculation.

Problem application:

1. I have heard of this word when I have been in college, and I think it is an application of this question to make a comparison between one article and more than 10 million articles, but the principle is the same.

2. Same-source documents. When we open a few pages on the Baidu website, we may find that many pages are similar, and the content is even duplicated. For example, many blogs on CSDN are copied from other places, news on various websites are sometimes the same or similar. If a website collects Daily News, it must be able to identify two articles with similar content. select one.

Similarity definition:

Jaccard similarity: the ratio of the intersection of the Set S and T to the Union size of the set. In the S document, there are three letters A, B, C, and T. There are five letters B, C, D, E, and F. Then, the similarity between S and T is 2 divided by 6, 1/3.

Troubleshooting

1. process a single document

Step 1 -- Shingling

The document is generally very long and cannot be compared with a single character. The most effective solution is to split the entire document into short character sets (k in length ), after such processing, if there are more identical elements in the Set, the similarity will be higher, and the sentence sequence can be ignored (many people often change the sentence sequence when copying the paper ).

For example, if the document is abcdabd and k = 2 is selected, the character set is {AB, bc, cd, da, bd }.

Of course k = 2 is definitely not good, so the maximum set is 26 ^ 2, it is estimated that any two long documents will think similar.

What is the specific k value? If the document is an email, k = 5 is enough. If the document is as large as the paper, k = 9.

In addition, there are many deprecated words in the document, such as the, and, to, which are generally ignored because they have no effect on the topic of the article.

Step 2 -- hash

If k is 9, the maximum value of the set is 26 ^ 9. Each element must be expressed in 9 bytes. the actual size of the set is the document length * 9, now I want to hash this element to 2 ^ 32 buckets so that each element can be expressed in 4 bytes, the effect of this approach is better than that of direct k = 4. The reason is that when k = 4, the actual set has a maximum of 26 ^ 4 elements, and is usually 20 ^ 4, because the frequency of occurrence of letters such as z and j is very low. The size of a set of 9 bytes can reach a maximum of 26 ^ 9.

Thanks to the hash algorithm once.

Step 3 -- Minimum hash

4-byte shingle is used. Do you need to save 4 times the size of each document? The goal of this step is to replace a large set with a much smaller "signature", and calculate the similarity of the signature set to estimate the similarity of the original set. When using a 50 kb shingle to 200Kb, when the signature set is only 1 kb, the final difference value may be within several percentage points.

Assume that there are M document sets and a total of N elements (the Union of all elements in the Set, N is very large), then the set can be represented by N rows and M columns, when this set contains this element, the corresponding position is 1; otherwise, it is 0.

We randomly select n (usually several hundred) as the signature size. We can construct the minimum hash signature vector [h1 (r), h2 (r) of the Set S )... hn (r)].

The procedure is as follows:

The initial matrix SIG (n * M) is positive infinity, and the r of each line is processed as follows:

(1) randomly select n Hash Functions and calculate h1 (r)... hn (r ).

(2) If the original N * M matrix is 0 and nothing is done, if it is 1, the new value in SIG is changed to the minimum value of the original values of hi (r) and SIG.

That is, the original N * M size matrix is transformed into a N * M size matrix through n-step iteration (for a document, N is changed to n ).

There is a theoretical basis for this method to estimate accuracy. Generally, the probability that the two minimum hash values of two sets are equal to the similarity of these two sets.

Thanks again for the hash algorithm.

2. The overall document processing is not very large, but the number of document pairs to be compared is too large. In practice, we focus on document pairs with Similarity greater than a certain value, so many documents with low similarity do not need to be compared. Processing Method: Local sensitive Hash (LSH)We perform multiple hash operations on the target item so that similar items are more likely to be in the same bucket than similar items. Then, we only need to compare the document pairs in the Same bucket. The non-similar document pairs that are hashed to the same bucket become pseudo-positive examples, and the real similarity is divided into two buckets as pseudo inverse examples. We hope that the fewer the two, the better. An effective method is to divide the n * M matrix into Blocks B, each of which is the * M column of the r row (n = br ). Hash the long sequence of each r to a large number of buckets. In this way, the matrix is reduced to B * M. For the two columns, if one row is in a bucket, It is a similar candidate pair. This method is highly accurate, for detailed theoretical analysis of LSH technology, you can view other documents. This LSH technology filters out most of the non-similar data objects in the filtering phase, greatly shortening the query computing time and improving the efficiency. Thanks again for the hash. SummaryFinally, we will summarize the common ideas for this problem: 1. First, select k to build a shingle set, and then map it to a shorter Bucket number through hash. 2. Calculate the minimum hash signature. 3. Use LSH technology to build candidate pairs. Each step uses a hash algorithm, reducing the complexity.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.