Here I want to discuss the next 5 algorithms to solve the de-duplication of web pages, reproduced from (http://blog.csdn.net/beta2/article/details/5014530)
1. I-match
2. shingliing
3. simhashing (locality sensitive hash)
4. Random projection
5. spotsig
6. Combined
I-Match Algorithm
The I-match algorithm has a basic assumption: uncommon words and frequently-occurring words do not affect the semantics of documents, so these words can be removed.
The basic idea of the algorithm is to use hash to represent words with semantics in the document into a number. The similarity of numbers can express the similarity of the document.
The algorithm framework is:
1. Obtain the document (or subject content)
2. Break down the document into a token stream and remove the formatted tag.
3. Use the term threshold (IDF) to retain meaningful tokens
4. Insert tokens to the sort tree in ascending order.
5. Calculate the sha1 of tokens
6. Insert the tuples (doc_id, Sha hash) into a dictionary. If the dictionary conflicts, the two documents are similar.
One disadvantage of algorithms is poor stability. If a word in the document changes, the final hash value will change significantly. The algorithm is invalid for empty documents.
One solution is to use the Randomization method. see Lexicon randomization for near-duplicate detection with I-match. The details are not mentioned here.
Shingling Algorithm
According to the shingling algorithm, it is obvious that I-match performs hash Based on words because it ignores the order between documents. In addition, shingle refers to consecutive strings of several words.
Shingling
Algorithms have a simple mathematical background. If the length of a shingle is K, a document whose length is N has n-k + 1 shingle. Each shingle can be MD5 or
Other algorithms are represented as fingerprint, while the similarity between the two documents is expressed by jacard similarity. The jarcard formula refers to the similarity between two sets = the intersection/set of the set.
And. In order to estimate the similarity between the two documents, sometimes n-k + 1 fingerprint is too large, so the M fingerprint function can be used for each function fi.
Calculate n-k + 1 fingerprint, and obtain the smallest fingerprint, which is called I-minvalue.
Then a document has m I-minvalues. In mathematics, Master Broder said:
On average, the ratio of the same unique single in two documents is the same as the ratio of the same I-minvalue in the two documents.
Shingling's algorithm framework is:
1. Obtain the document (or subject content)
2. Divide the document into n-k + 1 shingle, take M fingerprint functions, and calculate the I-minvalue value for each fingerpint Function
3. Combine m I-minvalue values into less m' surpersingle
4. Calculate the number of surpergingle of the two documents.
5. If a is greater than a certain value of B (say: 2), then the two documents jarcard are similar.
The general parameters are as follows: M = 84, M' = 6, B = 2
Simhash Algorithm
Locality sensitive hash algorithm is extensive and profound. The basic idea is that if two things are similar, I can use a hash function to project them into a similar space lsh. On near duplication detection, the algorithm framework is:
1. convert a document to a feature set. Each feature has a weight.
2. Use the LSH function to convert feature vectors to fingerprint with F bits, for example: 64
3. Find the Hamming distance of fingerprint
Haha, look, how simple and clear, Here are several questions to find the correct lsh in time
Random Projection Algorithm
Shingling pays attention to the document sequence, but ignores the frequency of document words. Random projection says I want to discuss the frequency of the document.
Random projection is also an interesting algorithm, which is a random algorithm. Simple Description:
1. Map each token to a B-bit space. Each dimension is composed. All page projection functions are the same.
2. The B dimension vector of each page is a simple addition of projection of all tokens.
3. At last, the positive number in the B-dimension vector is expressed as 1, and the negative number and 0 are both written as 0.
4. Compare the number of B-dimensional vectors of two pages.
Charikar
It is proved that the ratio of the bits of the two B-bit variables is the consine similarity of the document vector. The mathematical basics here are still very interesting. If you are interested, you can refer to M.S.
Charikar. similarity estimation techniques for rounding algorithm (May
2002)
Spotsig Algorithm
Ref: spotsigs: robust and efficient near duplicate detection in large web collection
Spotsig
It is an interesting algorithm. Why do I need to pay attention to all words? What are semantic words and what are semantic words? Oh, think about the this
The following is what I want to pay attention. Spot is the word string behind these virtual words. Then, I have a lot of spot for every document. Now a document is a spot
The similarity between the two documents is the jaccard similarity of the set. Although the algorithm is simple, I want to focus on two engineering performance considerations that are more useful.
1. Optimal Partition
SIM (a, B) = | a B intersection |/| A B Union | <= min (A, B)/MAX (A, B) <= | A |/| B | say: | A | <| B |
Good
This is a good cut condition. If the spot vector ratio in the document is smaller than a certain value (of course, small/large), you can skip the intersection. Optimal
Partition: that is to say, okay. I will project the spot vector length of each document to the corresponding bucket from small to large. | d1 |/| D2 |
> = R if | d1 | <| D2 |.
There is no such counterexample. Another guarantee is that the bucket meets the minimum conditions. With this partition, We only care about three adjacent buckets at most.
2. inverted index pruning
Two documents, if similar, have at least one common spot. The reverse index refers to the use of spot as an index and all its documents as its value.
With these two tools, the computing complexity can be significantly reduced because it does not calculate and cannot be a duplication document.