Calculating the similarity of two sets using the Minhash algorithm

Source: Internet
Author: User

The calculation of set similarity is a common problem. For example, it is known that people who have seen the 芈 month, and who have seen the Langya list, want to know the proportion of people who have seen both at least one of them, that is, to find the similarity between the two sets:

Set a = People who have seen the 芈 month pass
Set B = People who have seen the Langya list
Similarity = | a∩b| / | a∪b| = The number of people who have seen the 芈 and seen the Langya list/The number of people who have seen the 芈 month or Langya list

When there are fewer elements in the collection, we can compare each other to find the people who appear in set A and set B, count their numbers, and divide by the number of people that appear at least in set a or set B to get the similarity.

However, when the set of elements are more, such as each set has millions or tens of thousands of elements, the use of the complexity of O (N) comparison method is more urgent, unable to meet the requirements of the immediate calculation results of the scene.

By using the Minhash algorithm, the complexity can be reduced to a constant.

Let me explain the minhash algorithm through an example. Suppose you are a wildlife park administrator, responsible for the management of a monkey mountain, monkey Mountains no fence, so although monkeys are mostly used to stay on a hill, but they can also run to your jurisdiction, outside the monkey can also visit your hill.

If you suddenly want to know how many of the monkeys at the beginning of the month and the end of the monkey group at the end of the number of overlapping, that is, the proportion of "monkey mouth" is the number, is in the face of a similarity problem. You can catch all the monkeys at the beginning of the month, counting the number of C1, and releasing them with a number card, according to the thought of the comparison provided above. At the end of the month, you go up to the mountain to count all the monkeys, there are number of numbers of monkeys C2A, no number of monkeys c2b, then your answer is C2A/(C1 + c2b), done.

However, this method is too exhausting monkey, you decide to change a simple but rough method. You look up the hill, find the smallest monkey, grab it to put on the number card release, and record its number, and then find the largest monkey, do the same thing, and then find the most thin monkey, do the same thing, then the fattest monkey, the longest tail monkey, the shortest tail monkey, buttocks the most red monkeys, and a variety of the most xx monkeys, get recorded as follows:

Features | Number
Min | 1
Max | 2
The thinnest | 3
...
Most Refined | 19
The stupidest | 20

At the end of the month, you go up the hill with the book, find the smallest monkey, if it has a number plate and numbers is 1, it means that this monkey is the beginning of the month, you draw a circle in the book, if it is not the beginning of the month, draw a fork. Then find the biggest monkeys, do the same thing, and then continue, until all the most xx monkeys. You look down at the book, there are 15 laps, 5 forks, then the similarity is 15/(15 + 5) = 75%.

Why?

To put it simply, every time you look for the most xx monkeys with a purpose, you are actually building an index for the monkeys. And when the first element of the index in the two set is the same, it means that the two sets have a certain similarity, because they not only contain the element at the same time, but all of their elements are at a certain point of view greater than or equal to the element. Since each index has a different view angle, when we compare the first bit of 20 different indexes, 15 of them have encountered the same element, we can say that the two sets are quite, quite similar.

Let's change to a mathematical language. Given the set X, set the function for each index to be h (x), the first element in the index is hmin (x), the following formula exists:

Pr[hmin (A) = Hmin (B)] = | a∩b| / | a∪b|
Note: The PR represents the probability.

I'm not going to prove the formula, but I can simply verify the correctness of the formula from an extreme situation. Assuming that the number of elements in A∪b is N, we ideally find N h (x), respectively H1 (x), H2 (x) ... Hi (X) ... Hn (X), and they satisfy when the element k = i, hi (k) = i, when the element K≠i, HI (k) =∞. Thus, when element I is present in the set, Himin (x) = I, and himin (x) =∞ when there is no element I in the set, the necessary and sufficient condition for satisfying himin (a) = Himin (b) is that element I exists in both A and B, so that the H of hmin (a) = Hmin (b) is satisfied ( X) is equal to the number of | A∩b|, namely:

| a∩b| / | a∪b| = the number of h (x) that satisfies hmin (a) equals hmin (b)/h (x) ≈pr[hmin (a) = Hmin (b)]

The above is the explanation of Minhash algorithm. In practical applications, when an element of a set A is determined, it is possible to invoke a series of predefined H (X) calculations to get an array [H1min (a), H2min (a) ... Himin (A) ... Hnmin (A)], the array is stored as intermediate data, each time you need to calculate the similarity of any two sets A and B, read the corresponding array of each collection, by defining the following random variable R:

R = 1 if hmin (A) = Hmin (B) ELSE 0

The similarity of the obtained is:

∑r/n

On my side of the application, n = 2048. So regardless of the set of hundreds of tens of millions of elements, will be pre-drop Wi Cheng contains 2048 elements of the array (all 2048, all 2048, calculated not to lose, calculated not to be fooled ^_^), the subsequent similarity calculation of the complexity of natural also become a constant.

Reference:
Https://en.wikipedia.org/wiki/MinHash

Calculating the similarity of two sets using the Minhash algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.