Mining of massive datasets-finding similar items

Source: Internet
Author: User
Tags union of sets
Document directory
  • 1 applications of near-Neighbor Search
  • 2 shingling of documents
  • 3 similarity-preserving summaries of Sets
  • 4 locality-sensitive Hashing for documents
  • 5 distance measures
  • 6 The Theory of locality-sensitive functions
  • 7 lsh families for other distance measures

In the previous blog (http://www.cnblogs.com/fxjwind/archive/2011/07/05/2098642.html), I recorded the same issue with the relevant massive documentation, here we'll record the system for large-scale data mining technology, how to finding similar items ......

 

1 applications of near-Neighbor Search

TheJaccard SimilarityOf Sets S and T is | s limit T |/| s limit T |, that is, the ratio of the size of the intersection of S and T to the size of their union. we shall denote the jaccard similarity of S and T by SIM (S, T ).

 

Similarity of statements, finding textually similar statements in a large corpus such as the Web or a collection of news articles.

 

Collaborative filtering, a process whereby we recommend to users items that were liked by other users who have exhibited similar tastes.

On-line purchases, Amazon.com has millions of MERs and sells millions of items. its database records which items have been bought by which MERs. we can say two customers are similar if their sets of purchased items have a high jaccard similarity.

 

2 shingling of documents2.1 K-Shingles

A document is a string of characters. Define a K-shingle for a document to be any substring of length k found within the document.

Example 3.3: Suppose our document D is the string abcdabd, And we pick K = 2. then the Set of 2-shingles for D is {AB, BC, CD, da, BD }.

 

2.2 choosing the shingle size

Thus, if our corpus of orders ents is emails, picking K = 5 shocould be fine.

275 = 14,348,907 possible shingles. Since the typical email is much smaller than 14 million characters long, we wowould would keep ct k = 5 to work well, and indeed it does.

For large documents, such as research articles, choice K = 9 is considered safe.

 

2.3 hashing Shingles

Instead of using substrings directly as shingles, we can pick a hash function that maps strings of length k to some number of buckets and treat the resulting Bucket number as the shingle. the set representing a document is then

Not only has the data beenCompacted, But we can now manipulate (hashed) shinglesSingle-word machine operations.

 

2.4 shingles built from words

An alternative form of shingle has proved valid tive for the problem of identifying similar news articles.

News articles, And most prose, have a lot of stop words, the most common words such as "and," "you," "to," and so on.

Defining a shingle to be a Stop Word followed by the next two words, regardless of whether or not they were stop words, formed a useful set of shingles.

Example 3.5: An ad might have the simple text "buy sudzo. "However, a news article with the same idea might read something like" a spokesperson for the sudzo Corporation revealed today that studies have shown it is good for people to buy sudzo products."

The first three shingles made from a Stop Word and the next two following are:

A spokesperson
For the sudzo
The sudzo Corporation

 

3 similarity-preserving summaries of Sets

Sets of shingles are large. even if we hash them to four bytes each, the space needed to store a set is still roughly four times the space taken by the document. if we have millions of clients, it may well not be possible to store all the shingle-sets in main memory.

Our goal in this section is to replace large sets by much smaller representations called "signatures ."

 

3.1 Matrix Representation of Sets

Example 3.6: An example of a matrix representing sets chosen from the universal set {a, B, c, d, e }. here, S1 = {A, D}, S2 = {c}, S3 = {B, d, e}, and S4 = {a, c, d }.

Each column can be used as the signature of the set. However, because there are many elements, the signature is too long. The following describes how to use minhashing to compress signature.

 

3.2 minhashing

 

Example 3.7: let us suppose we pick the order of rows 'beads' for the matrix of example 3.6. this permutation defines a minhash function H that maps sets to rows. in this matrix, we can read off the values of h by scanning from the top until we come to a 1. thus, we see that H (S1) = A, H (S2) = C, H (S3) = B, and H (S4) =.

 

3.3 minhashing and jaccard Similarity

The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the jaccard similarity of those sets.

For a random row arrangement (permutation), the probability of the same value obtained by the minhash of the Two sets is their jaccard similarity.

For example, in S1 and S4, The minhash value is a, that is, the two sets contain a, and the probability should be the same as that of jaccard similarity.
This ensures the correctness of using minhashing. jaccard similarity serves as the basic principle for determining whether two sets are similar. The higher the result, the more similar the two sets.

By using this theorem, if the minhashing of two sets is more similar, it means that the two sets are more similar.

 

3.4 minhash signatures

Perhaps 100 permutations or several hundred permutations will do. Call the minhash functions determined by these permutations H1, H2,..., HN. From the column representing set S, constructMinhash SignatureFor S, the vector [H1 (s), H2 (s),..., HN (s)].

A minhash value can be calculated for a sort. For a random sort of N, N minhash values can be calculated, and the N values can be combined as the signature of the set. the principle is feature sampling, which greatly compresses the signature size. the signature size is determined by the number of permutations selected.

 

3.5 computing minhash signatures

It is not feasible to permute a large limit explicitly. Even picking a random permutation of millions or billions of rows is time-consuming, and the necessary sorting of the rows wocould take even more time.

The actual permuate matrix is inefficient. We can simulate this permuate. The idea is also very simple,

A random hash function that maps row numbers to as your buckets as there are rows.

Thus, instead of picking n random permutations of rows, we pick n randomly chosen hash functions H1, H2,..., HN on the rows.

The random hash function is used to randomly arrange row numbers to simulate the Randomization of matrices.

 

4 locality-sensitive Hashing for documents

Even though we can use minhashing to compress large documents into small signatures, it still may be impossible to find the pairs with greatest similarity efficiently. the reason is that the number of pairs of specified ENTs may be too large, even if there are not too specified documents.

Although minhashing can be used to compress large texts into smaller signature, this greatly optimizes the speed of comparison between two documents, but it cannot resist much faster, if you want to find duplicated documents in the billion-Level Document Set, other technologies are required.

If our goal is to compute the similarity of every pair, there is nothing we can do to reduce the work, although parallelism can reduce the elapsed time. however, often we want only the most similar pairs or all pairs that are some lower bound in similarity. if so, then we need to focus our attention only on pairs that are likely to be similar, without investigating every pair. there is a general theory of how to provide such focus, calledLocality-sensitive hashing (lsh)Or near-Neighbor Search.

The basic idea is that it is not comparable to each other, so it must be very slow. You must first filter and compare only those that may be duplicated or similiar.

 

4.1 lsh for minhash signatures

One general approach to LSH is to "hash" itemsSeveral times, In such a way that similar items are more likely to be hashed to the same bucket than dissimilar items are. We then consider any pair thatHashed to the same bucketFor any of the hashings TO BECandidate pair. We check only the candidate pairs for similarity.

We should try to avoid hash different items to the same bucket, and be mistaken for candidate (False positives), Try to ensure that the same items is hashed to the same bucket at least once, so as to avoid candidat being missed (False negatives)

In my opinion, for general tasks, we should minimize false negatives while ensuring certain false positives, because false negatives cannot be compensated.

 

If we have minhash signatures for the items, an effective way to choose the hashings is to divide the signature matrix into B bands consisting of R rows each.
The method mentioned above is hash multiple times. Here, if each document has n-bit minhash signatures, it can be divided into B segments, each R bit (B * r = N ).

For each band, there is a hash function that takes vectors of R integers (the portion of one column within that band) and hashes them to some large number of buckets.

Hash the I segment (1 <I <B) of each document separately. only when the hash values of the two documents are the same in any segment, the two documents are considered to have the same possibility.

The more similar two columns are, the more likely it is that they will be identical in some band. Thus, intuitively the banding strategy
Makes similar columns much more likely to be candidate pairs than dissimilar pairs.

 

4.2 analysis of the Banding Technique

Suppose we use B bands of R rows each, and suppose that a participant pair of specified Ents have jaccard similarity S.

According to the previous theorem,

The probability the minhash signatures for these documents agree in any one particle row of the signature matrix is S.

The probability that r bits in a section are the same is: Sr

In a section, the R bit must have at least one different probability, 1-sr.

In section B, the R bit has at least one different probability, (1-SR) B

Therefore, the probability of at least one segment of B being exactly the same is 1-(1-SR) B.

Example 3.11: B = 20 and r = 5, S = 0.8, then 1-(1-SR) B = 0.99965

It can be proved that this segmentation technology is feasible.

 

4.3 combining the techniques

We can now give an approach to finding the set of candidate pairs for similar documents and then discovering the truly similar documents among them.

1. Pick a value of K and construct from each document the set of K-Shingles. Optionally, hash the K-shingles to shorter bucket numbers.

2. Sort the document-shingle pairs to order them by shingle.

3. Pick a length N for the minhash signatures. Feed the sorted list to the algorithm of section 3.3.5 to compute the minhash signatures for all the documents.

4. choose a threshold t that defines how similar events Ents have to be in order for them to be regarded as a desired "similar pair. "pick a number of Bands B and a number of rows R such that BR = N, and the threshold T is approximately (1/B) 1/R. if avoidance of false negatives is important, you may wish to select B and R to produce a threshold lower than T; if speed is important and you wish to limit false positives, select B and R to produce a higher threshold.

5. Construct candidate pairs by applying the LSH Technique of Section 3.4.1.

6. Examine each candidate PAIR's signatures and determine whether the fraction of components in which they agree is at least T.

7. Optionally, if the signatures are sufficiently similar, go to the provided ents themselves and check that they are truly similar, rather than provided ents that, by luck, had similar signatures.

This solves two problems. The document itself is not conducive to mutual comparison and how to select the candidate to be compared in a large number of documents

1 ~ Step 3 solves the first problem by taking the document-> shingles hash-> minhash signature, which is a gradual compression process and eventually becomes minhash signature, it is very suitable for mutual comparison, and the attention form of minhash signature can directly reflect the similarity between documents (jaccard similarity ).

4 ~ Step 7 solves the second problem. LSH is used to filter the candidate to be compared to achieve efficient comparison of massive documents.

 

5 distance measures

We now take a short detour to study the general notion of distance measures.

5.1 Definition of a distance measure

A distance measure on this space is a function d (x, y) that takes two points in the space as arguments and produces a real number, and satisfies the following axioms:

1. d (x, y)> = 0
2. d (x, y) = 0 When and only when x = y
3. d (x, y) = d (y, X)
4. d (x, y) <= d (x, z) + d (z, Y)

 

5.2 Euclidean distances

The most familiar distance measure is the one we normally think of as "distance." absolute distance between two points

5.3 jaccard distance

Jaccard distance of sets by d (x, y) = 1 −sim (x, y ). that is, the jaccard distance is 1 minus the ratio of the sizes of the intersection and Union of sets X and Y.

Used to indicate the similarity between two sets.

 

5.4 cosine distance

The cosine distance between two points isThe angle that the vectorsTo those points make. This angle will be in the range 0 to 180 degrees.

TheDot ProductCan be defined for two vectors and by so

Unlike the Euclidean distance, the calculation is not the absolute distance between two points, but the angle between two points. Therefore, the Euclidean distance is applicable to different scenarios and is more commonly used in practice.

 

5.5 edit distance

This distance makes sense when points are strings. the distance between two strings x = x1x2 · XN and Y = y1y2 · ym is the smallest number of insertions and deletions of single characters that will convert X to Y.

Example 3.14: the edit distance between the strings x = ABCDE and Y = acfdeg is 3. to convert X to Y:
1. Delete B.
2. Insert f after C.
3. Insert g after E.

The way to calculate the edit distance d (x, y) is to compute a Longest Common subsequence (LCS) of X and Y.

The edit distance d (x, y) can be calculated as the length of x plus the length of Y minus twice the length of their LCS.

LCS (longest common subsequence) is implemented through dynamic planning, see (http://www.cnblogs.com/fxjwind/archive/2011/07/04/2097752.html)

Example 3.15: the strings x = ABCDE and Y = acfdeg from example 3.14 have a unique LCS, Which is acde.

The edit distance is thus 5 + 6 −2 × 4 = 3

 

5.6 Hamming distance

Given a space of vectors, we define the Hamming distance between two vectors to be the number of components in which they differ.

Example 3.16: the Hamming distance between the vectors 10101 and 11110 is 3.

 

6 The Theory of locality-sensitive functions

We shall need e other families of functions, besides the minhash functions, that can serve to produce candidate pairs efficiently.

Our first step is to define "locality-sensitive functions" generally.

6.1 locality-sensitive functions

We shall consider functions that take two items and render a demo-about whether these items shoshould be a candidate pair.

A collection of functions of this form will be called a family of functions.

Let D1 <D2 be two distances according to some distance measure D. A family F of functions is said to be (D1, D2, P1, P2)-sensitive if for every F in F:
1. if d (x, y) ≤ D1, then the probability that f (x) = f (y) is at least P1.
2. If d (x, y) ≥ D2, then the probability that f (x) = f (y) is at most P2.

A group of functions. For any function in the family,

If the distance between two points is close, d (x, y) is less than or equal to D1, the probability of the two points generated by the function being equal to at least p1

If the distance between two points is relatively long, d (x, y) is ≥d2, the probability of the two points generated by the function being equal to that of P2 is the maximum

Simply put, the closer it is, the higher the probability of the generated hash value being equal. In this way, this function family F is locality-sensitive functions and is (D1, D2, P1, P2) -sensitive

 

6.2 locality-sensitive families for jaccard distance

According to the general locality-sensitive functions definition above, let's first define locality-sensitive families, minhash functions based on jaccard distance,

Use the family of minhash functions, and assume that the distance measure is the jaccard distance.

The family of minhash functions is a (D1, D2, 1−d1, 1−d2)-sensitive family for any D1 and D2, where 0 ≤ D1 <D2 ≤ 1.

From the theorem below, it is not difficult to prove this definition,

Jaccard similarity of X and Y is equal to the probability that a minhash function will hash X and Y to the same value.

 

6.3 amplifying a locality-sensitive family

How to expand locality-sensitive family?

Suppose we are given a (D1, D2, P1, P2)-sensitive family F. Can we construct a new family F ′?

The method is to combine the r functions in family f into a new function.

Each member of f' consists of R members of F for some fixed R. if F is in F', and F is constructed from the set {F1, F2 ,..., fr} of members of F, we say f (x) = f (y) if and only if Fi (x) = Fi (y) for all I = 1, 2 ,..., r.

There are two combination methods: And, or, and the obtained famliy is different.

And-Construction

F'' is a (D1, D2, (P1) R, (P2) R)-sensitive family

Or-Construction

F'' is a (D1, D2, 1−( 1−p1) B, 1−( 1−p2) B)-sensitive family

 

And-ConstructionLowersAll probabilities, but if we choose F and R judiciously, we can make the small probability P2 get very close to 0, while the higher probability P1 stays significantly away from 0.

Or-construction makes all probabilitiesRise, But by choosing F and B judiciously, we can make the larger probability P1 approach 1 while the smaller probability P2 remains bounded away from 1.

We can cascade and-and or-constructions in any order to makeThe low probability close to 0AndThe high probability close
To 1
. Here we need a balance to achieve reasonable false positives and false negatives.

By using a locality-sensitive family, you can expand countless locality-sensitive families and use different and or combinations to achieve better results.

 

7 lsh families for other distance measures

In addition to jaccard distance, the following describes lsh families based on other distances.

7.1 lsh families for Hamming distance

Suppose we have a space of D-dimen1_vectors, and h (x, y) denotes the Hamming distance between vectors x and y. if we take any one position of the vectors, say the ith position, we can define the function fi (X) to be the ith bit of vector X. then fi (x) = Fi (y) if and only if vectors x and y agree in the ith position.

Then the probability that Fi (x) = Fi (y) for a randomly chosen I is exactly 1 −h (x, y)/d.

If we understand the Hamming distance, it is hard to understand that only the h (x, y) bits need to be changed, excluding the h (x, y) bits, and the other bits are equal.

The family f consisting of the functions {F1, F2,..., FD} is(D1, D2, 1 −d1/d, 1 −d2/d)-sensitiveFamily of hash functions, for any D1 <d2.

The definition is provided here, but no specific functions famliy can meet such conditions.

For jaccard distance, specific functions famliy and minhash are provided.

 

7.2 random hyperplanes and the cosine distance

Each hash function f in our locality-sensitive family F is built from a randomly chosen vector VF. given two vectors x and y, say f (x) = f (y) if and only if the dot products VF. X and VF. Y have the same sign. then f is a locality-sensitive family for the cosine distance.

That is, F is(D1, D2, (180 −d1)/180, D2/180)-sensitiveFamily of hash functions.

As described in the previous articleCharikar's simhashIt should be a cosine distance lsh families.

7.4 lsh families for Euclidean distance

It is also based on a projection method...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.