Minhash algorithm of text de-weight

Source: Internet
Author: User

1. Overview


like Simhash, Minhash is also a lsh that can be used to quickly estimate the similarity of two sets. Minhash was proposed by Andrei Broder, originally used to detect duplicate pages in search engines. It can also be applied to large-scale clustering problems. 2.Jaccard index before introducing Minhash, we introduce the next Jaccard index.

Jaccard index is a metric used to calculate similarity, or distance. If there is a set a, B, then,

That is, the jaccard coefficient of the set A, a, is equal to the number of elements in a A, a and b that have a total number of elements. Obviously, the Jaccard coefficient value interval is [0,1]. 3.MinHashdefine several symbolic terms first:h (x): a hash function that maps x to an integer. hmin (S): Elements in the set S that have the smallest hash value after the H (X) hash.

The condition for set A, b,hmin (a) = Hmin (B) is that the element with the smallest hash value in A∪b is also in ∩b. Here

There is a hypothesis that H (x) is a good hash function, which has good uniformity and ability to map different elements into different integers.

So there, pr[hmin (A) = Hmin (B)] = J (A, A, b), that is, the similarity of the set A and B. is set A, b after the hash of the minimum SID value phase

The probability of waiting.

With the above conclusion, we can calculate the similarity of two sets according to the Minhash. There are generally two methods: the first: Using multiple hash functionsin order to calculate the probability that the set A and B have a minimum hash value, we can choose a certain number of hash functions, such as K. Then we use the K hash function to hash the set A and B respectively, toEach collection is given a K minimum value. For example Min (A) k={a1,a2,..., ak},min (B) k={b1,b2,..., BK}. So, the similarity of collection A and B is | Min (A) k∩min (B) k| / | Min (a) k∪min (B) k|, and Min (a) K andMin (B) The ratio of the number of identical elements to the total number of elements in K. The second type: using a single hash function The first method has a very obvious flaw, that is, the computational complexity is high. How do you solve this problem with a single hash function? See: Before we define HMIN (s) as an element with the smallest hash in the set S, we can also define Hmink (s) as the K element with the smallest hash value in the set S. In this way, we only need to hash each set once, and then take the smallest element of K.              Calculating the similarity of two sets A and B is the ratio of the number of intersections and the number of sets of the smallest k elements in set A to the smallest k elements in set B. After reading the above, you should probably know what Minhash is going on. But where is the benefit of Minhash? To calculate the similarity of two documents, we can directly count the same number of words and the totalThe number of times, then the Jaccard index is not OK? Yes, Minhash has no advantage in calculating the similarity of two documents, but complicates the problem. However, if you have a large amount of documents to require similarity, such as in the Recommender system Calculates the similarity of the items, if 22 calculates the similarity, the computational amount is too large. Let's see how Minhash solves the problem.such as the element set {a,b,c,d,e}, where s1={a,d},s2={c},s3={b,d,e},s4={a,c,d} then the matrices of these four collections are represented as:

If you want to do a minhash on a collection, you can select one from any row of the above matrix, and then the Minhash value is the line number of the first 1 in the arrangement. For example, for the above matrix, we choose to arrangeBEADC, then the corresponding matrix isSoh (S1) = A, also can get H (S2) = C, h (S3) = B, h (S4) = A. if only one of the rows is Minhash, it goes without saying that the computational similarity is certainly unreliable. Therefore, we want to select multiple row permutations to calculate the Minhash, and finally calculate the similarity based on the Jaccard index formula . But the complexity of the permutation itself is relatively high, especially for large matrices. Therefore, we can design a random hash function to simulate the permutation, which can map the line number 0~n randomly to the 0~n. such as H (0) =100,h (1) =3 .... Of course, conflicts are unavoidable and can be hashed two times after a conflict. And if the selected random hash function is uniform enough, and when N is large, the probability of the collision is still relatively low. The random permutation algorithm can be referenced in this article: Random permutation generation algorithm of some Caprice Speaking of which, we only discuss the concrete process of using Minhash to find the similarity degree of the massive document, but how does it reduce the complexity? For example, there are n documents, the dimension of each document is M, we can choose the K permutation of the Minhash, because each of each arrangement, minhash a Document Map to an integer, so the K permutation calculation Minhash to get K integer. Then the Minhash matrix is n*k dimension, and the original matrix is n*m dimension.     When N>>m, the amount of computation falls down. 4. References (1) http://en.wikipedia.org/wiki/MinHash (2) http://fuliang.iteye.com/blog/1025638 original address: Http://my.osc hina.net/u/576409/blog/65210

Minhash algorithm of text de-weight

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.