Learn Local sensitive Hash

Last Update:2015-12-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Application of nearest Neighbor Method 1.1 Jaccard similarity set

How to define similarity: that is, the size of the intersection of related attributes, the larger the more similar. We give a similar mathematical definition: the Jaccard similarity set.

The collection \ (s\) and collection \ (t\) Jaccard collection is defined as \ (| S \cap t|/| S \cup t|\), which is the ratio of their intersection to the size of the set. We are simply defined as \ (SIM (s,t) \)
?

1.2 Similarity of files

A wide range of search-content-like text is a very important application of similarity analysis:

Find plagiarism, or similarity query
Web site maintenance for different host mirror pages
Find a homologous but modified article

2. Shingling of Documents

This paper introduces a common method of text similarity detection.

2.1 K-shingles

Any text is a string of symbols that defines k-shingles as a substring of length k within any text.
Suppose we have the text \ (\mathcal{d}=abcdabd\), we fetch \ (k=2\), then \ (\mathcal{d}\) the 2-shingles for {\ (AB, BC, CD, DA, Bd\)}.

2.2 Choosing the Shingle size

K should choose a value that is sufficiently large to allow any shingle in any text to appear in low enough probability.

For example, for all ASCII characters of text, choose \ (k=5\) to know that there will be a total of \ (27^5=14,348,907\) different combinations. Because ordinary mail is difficult to have so long text, so we estimate \ (k=5\) is a good choice. However, because the frequency of each letter is different, it makes the calculation more complicated. General take probability \ (20^{k}\), research shows that \ (k=9\) is a relatively safe setting.

2.3 Hashing Shingles

For ease of use, we set a hash function to map all substrings of length k into a numbered bucket and mark shingle with a bucket number.

3. Matrix representation of similarity-preserving summaries of Sets3.1 collection

Rows represent all elements, and columns represent subcollections:
?
In this example, the total collection is {\ (a,b,c,d,e\)}, child collection \ (s_{1}=\{a,d\}\), \ (s_{2}=\{c\}\), \ (s_{3}=\{b,d,e\}\), \ (s_{4}=\{a,c,d\}\).
A matrix is well suited to represent subcollections, but because it is always sparse, it is not appropriate to store subcollections.

3.2 Minimum Hash

How do you calculate a minimum hash value to represent a sub-collection? The first number we encounter as a behavior reference is its classification, which obviously does not always succeed in classifying, and we need to adjust the order of the rows as far as possible to achieve the purpose of classification.
Here is an example:
?
Continuation, the minimum hash function \ (h\) mapping is set on the line:
subcollections \ (H (s_{1}) =\{a\}\), \ (H (s_{2}) =\{c\}\), \ (H (s_{3}) =\{b\}\), \ (H (s_{4}=\{a\}\).

3.3 Minimum hash and jaccard-similar relationships

Given a random element arrangement, the probability that the minimum hash function assigns the same value to two subcollections is equal to the jaccard of the two values.

3.4 Minimum Hash signature

Given the total matrix \ (m\), we randomly extract \ (n\), given the collection of subcollections (s\), we get \ ([H_1 (s), h_2 (s), ..., h_n{s}]\), so we get a matrix \ (m\) a signature matrix, the signature matrix is generally better than \ (M \) a lot smaller. (Number of columns unchanged, number of rows decreased)

3.5 Compute Hash Minimum signature

To facilitate the calculation, we use a random hash function to simulate the random row arrangement of the element, which maps the number of rows to the same number of stabs. Here is an example:
?
After obtaining these hash functions, we begin to calculate the hash minimum signature:
1. Define \ (SIG (i,c) \) is the element in the signature matrix for the i\ Hashidi \ (c\) column.
2. Initialize all \ (SIG (i,c) \) to positive infinity.
3. For each row \ (r\) calculation, first is H_{i} ({R}), and then for each column \ (c\), do:

If \ (c\) is 0 in line \ (r\), nothing is done.
If \ (c\) is 1, for all \ (i =,..., n\), set \ (Sig (I,c) =min (Sig (I,c), h_i ({R})).

Example:
Following initialization:
?
Detects row row 0, finds \ (h_1 (0) =h_2 (0) =1\), and this line is 1 on \ (s_1\) and \ (s_4\). So we changed these two columns:
?
After moving to row 1, you can find only the changes \ (s_3\):
?
Next is row 2, only with the change \ (s_2\):
?
In Down is row 3, found \ (S_1, s_2, s_3\) need to change, because to take the smaller value, so \ (h_1 (3) =4\) is not assigned:
?
After that is row 4, which is changed to get:
?

Finally, let's check the functionality of this signature table.
The only error comes from \ (SIG (s_1,s_4) \) Note the calculation method

4. Locality-sensitive Hashing for Documents

Just using the minimum hash is not enough to face high dimensional data.

4.1 LSH for Minhash Signatures

The basic idea of LSH is to have multiple hashes of a project, and after multiple hashing operations, similar items are more likely to be categorized into a bin by hashing operations. At this point we can only consider the items that fall in a bucket, which greatly reduces the amount of computation.
We hope that projects that are not similar will never be split into the same bucket, and that similar projects should never be split into different buckets.

We define a non-similar item that is divided into buckets to false positive
We define a similar item to be split out of the bucket to false negative

We use an efficient method to apply Minhash signatures: We divide the entire signature matrix rows into B band, each band with r rows, total rows (number of hash functions \ (=b\times r\)). Every band we set up an identical hash function to convert the r-vector.
?
That is, the hash signature matrix and then the hash operation, the first band, its second and fourth items are \ ([0;2;1]\) so they are the same hash operation is divided into the same bucket probability is \ (100\%\). This time without considering other band, these two items will certainly be listed as likely to be similar to the candidate.

In a band, there is no pairing that is hashed into a bucket, and it is possible to be a candidate, as long as there is a successful pairing in the other band.

Analysis of 4.2 Band

If the probability of two text being given the same hash value on any particular line is \ (s\), then the probability of being a candidate pairing item is:
The probability that all row in this band has the same hash: \ (s^r\)
At least one row in this band has a different hash:\ (1-s^r\)
The probability that all band inside hash are not exactly the same: \ ((1-s^r) ^b\)
At least one band the probability that the hash is exactly the same: \ (1-s^r) ^b\)
?
How to calculate S is a legacy issue

4.3 Combining the techniques

Select the appropriate \ (k\) value to build \ (k\)-shingles. Optional: hash operations on \ (k\)-shingles.
Sort the document-shingle pairs.
Select a length \ (n\) to construct the minimum hash signature matrix.
Select Threshold \ (t\) to determine the selection criteria for similar projects. Select the corresponding \ (k\) and \ (r\) satisfy \ (n=k \times r\). and \ (t\) about meet \ (t= (\frac{1}{b}) ^{{\frac{1}{r}}}\)
Use LSH to select candidate pairs
Check all candidate pairs to ensure that the signature similarity is above \ (t\).

5. Definition of distance measurement 5.1 distance

Distance needs to be met:
?

5.2 Euclidean distance

The n-dimensional Euclidean distance is defined as:
?
Note that the \ (l_2\) norm is used here, and the pattern under its \ (l_r\) norm is:
?
When \ (r=1\) We get the Manhattan distance
Another interesting thing is that \ (r\) tends to infinity when it asks for the maximum value of \ (|x_i-y_i|\) on all dimensions \ (i\).

5.3 Jaccard Distance

Jaccard distance defined as \ (d (x, y) =1-sim (x, y) \)

5.4 Cosine distance

The cosine distance between, points is the angle, the vectors to those points make
The cosine distance is measured by the angle between the two vectors.

5.5 Edit Distance

Used to determine how many steps are inserted and deleted, from X to Y:
?

5.6 Hamming Distance

The Hamming distance of two vectors is the number of different parts of them. For example, the Hamming distance of 10101 and 11110 is 3.

6. The theory of locality-sensitive Functions

Section Fourth describes just one LSH function (Minhash functions) that can be combined with the band method, which we'll cover more.

All LSH functions need to satisfy:
1. The nearer the point pair the more likely to become the candidate
2. Statistically irrelevant
3. Not only improve speed but also better to avoid false points

6.1 Locality-sensitive Functions

Defined:
\ (f (x) =f (y) \) x, y as candidate pair
\ (f (x) \neq f (y) \) x, y not as candidate
All functions \ (f\) make up a family of functions. Definition \ (D_1 < d_2\) for two distances under a distance metric, we define a function family \ (\mathbb{f}\) for \ ((d_1,d_2,p_1,p_2) \) Sensitive when any of the functions \ (f\) satisfy:
* When \ (d (x, y) \le d_1\), \ (f (x) =f (y) \) The probability is at least \ (p_1\)
* * when \ (d (x, y) \ge d_1\), \ (f (×) = f (y) \) The probability is at most \ (p_2\)
* ?

6.2 locality-sensitive Families for Jaccard Distance

So far, there is only one way we can find the LSH function: Use the minimum hash function and assume that the distance metric is the jaccard distance.
* The minimum hash function is \ (((d_1,d_2,1-d_1,1-d_2)) \) sensitive. This is clearly in line with my previous criteria.
For example, setting \ (d_1=0.3, d_2=0.6\), the minimum hash function satisfies \ ((((0.3,0.6,0.7,0.4)) \) sensitive. That is, when the jaccard distance of two items is at most 0.3, that is, when their similarity is greater than 0.7, at least 70% of them may choose to be a candidate pair. When their similarity is less than 0.4, at most, only 40% may count them as the same class.

6.3 Extended locality-sensitive Families Application LSH

Here I translated Indyk et al 's lsh matlab code document

Code: http://ttic.uchicago.edu/~gregory/download.html

Main functions

LSH.M Building LSH Data structures on the set of inputs
lshprep.m Set and start hash function
lshfunc.m Set and start hash table based on hash function
lshins.m inserting data into the hash table
lshlookup.m using LSH for NNS search
lshstats.m Checking the generated LSH data structure

A simple example is included in the standalone folder lshtst/:
Patches.mat -59,500 x 20x20 grayscale picture fragments (matrix converted to 59500*400)

First load the data collection:

>> load patches;

After we started building the first LSH data structure, we built a LSH data structure with 20 hash tables, using 24-bit key and using the simple LSH architecture:

>> T1=lsh('lsh',20,24,size(patches,1),patches,'range',255); BUNLIMITED 24 keys 20 tables 13598 distinct buckets Table 1 adding 13598 buckets (now 13598)Table 1: 59500 elements 11540 distinct bucketsTable 2 adding 11540 buckets (now 11540)Table 2: 59500 elements 11303 distinct bucketsTable 3 adding 11303 buckets (now 11303)Table 3: 59500 elements 12046 distinct bucketsTable 4 adding 12046 buckets (now 12046)Table 4: 59500 elements 12011 distinct bucketsTable 5 adding 12011 buckets (now 12011)Table 5: 59500 elements 10218 distinct buckets... ... ... ...Table 20 adding 15435 buckets (now 15435) Table 20: 59500 elements

The input parameters of the LSH () function are sequentially 1. Use Architecture 2. Number of hash tables 3. The number of bits in the hash key is 4. Data length 5. Data 6. Set Range (?)

We can see in turn a total of 59,500 data in each hash table performance. For example, in the first table, all the data is divided into 13,598 buckets. The second line shows how much of the original data is classified, and if we keep the bucket size infinite, the number is always equal to the length of the data.

We use Lshstats () to check the statistics of the LSH data structures that we have learned:

>> lshstats (T1), tables, keystable 1:59500 in 13598 bkts, med 1, Max 2687, avg 520.01Table 2:59500 in 11540 b  KTS, med 1, max 3605, avg 626.10Table 3:59500 in 11303 bkts, med 1, max 4685, avg 912.04Table 4:59500 in 12046 bkts, Med 1, Max 3385, avg 652.34Table 5:59500 in 12011 bkts, med 1, max 2393, avg 510.91Table 6:59500 in 10218 bkts, med 1, max 2645, Avg 618.73Table 7:59500 in 11632 bkts, med 1, max 3472, avg 779.28Table 8:59500 in 16140 bkts, med 1, max 5474, Av G 828.26Table 9:59500 in 13289 bkts, med 1, max 3300, avg 543.68Table 10:59500 in 13905 bkts, med 1, max 3087, avg 671.8  3Table 11:59500 in 13165 bkts, med 1, Max 3914, avg 714.18Table 12:59500 in 12855 bkts, med 1, Max 4510, avg 759.96Table 13:59500 in 15263 bkts, med 1, max 3414, avg 641.39Table 14:59500 in 12601 bkts, med 1, max 3228, avg 707.47Table 15:5 9500 in 14790 bkts, med 1, max 4412, avg 725.62Table 16:59500 in 11448 bkts, med 1, max 4144, avg 696.01Table 17:59500 I n 17118 bkts, med 1, max 6394, Avg 901.37Table 18:59500 in 15205 bkts, med 1, max 2971, avg 566.43Table 19:59500 in 11527 bkts, med 1, Max 2901, Avg 609.99Table 20:59500 in 15435 bkts, med 1, Max 6199, avg 931.30Total 59500 elements

The

shows the data size, the number of buckets, the smallest (?) /maximum/Average amount of data in a bucket.
through command lshstats (t1,100), you can get more information:

>> lshstats (t1,100), tables, keystable 1:59500 in 13598 bkts, med 1, Max 2687, avg 520.01, 25458 > 100T  Able 2:59500 in 11540 bkts, med 1, max 3605, avg 626.10, (28790) > 100Table 3:59500 in 11303 bkts, med 1, Max 4685,  Avg 912.04, 31263 > 100Table 4:59500 in 12046 bkts, med 1, max 3385, avg 652.34, si (28664) > 100Table 5:59,500 In 12011 bkts, med 1, max 2393, Avg 510.91, (27325) > 100Table 6:59500 in 10218 bkts, med 1, max 2645, avg 618.73, 31034 > 100Table 7:59500 in 11632 bkts, med 1, max 3472, avg 779.28, (30374) > 100Table 8:59500 in 16140 bkt S, med 1, max 5474, avg 828.26, Wuyi (24696) > 100Table 9:59500 in 13289 bkts, med 1, max 3300, avg 543.68, si (27940) ;  100Table 10:59500 in 13905 bkts, med 1, max 3087, avg 671.83, (26870) > 100Table 11:59500 in 13165 bkts, med 1, max 3914, Avg 714.18, 100Table (26659) > 100Table 12:59500 in 12855 bkts, med 1, Max 4510, Avg 759.96, (26986) > 1 3:59500 in 15263 bkts, med 1, max 3414, Avg 641.39, (25801) > 100Table 14:59500 in 12601 bkts, med 1, max 3228, Avg 707.47, (28723) ;  100Table 15:59500 in 14790 bkts, med 1, max 4412, Avg 725.62, $ (25435) > 100Table 16:59500 in 11448 bkts, med 1, max 4144, Avg 696.01, (30972) > 100Table 17:59500 in 17118 bkts, med 1, max 6394, Avg 901.37, (24456) > 100Table 1  8:59500 in 15205 bkts, med 1, max 2971, avg 566.43, si (25890) > 100Table 19:59500 in 11527 bkts, med 1, max 2901, avg 609.99, 31533 > 100Table 20:59500 in 15435 bkts, med 1, Max 6199, avg 931.30, 25821 > 100Total 59500 eleme Nts

(25821) > 100 means that in this hash table, 49 buckets contain more than 100 data, a total of 25,831 data.

We can also do tests to test the performance of LSH: The third is the original data marked by LSH Learning, the fourth is the data we are looking for, and the fifth is the range of NNS.

>>lshstats(T1,'test',patches,patches(:,1:1000),2);... ... ... ...  >>Running test...10% 20% 30% 40% 50% 60% 70% 80% 90% 100%  # of comparisons: mean 2809.38, max 10795, failures: 4

In a total of 59,500 data, we compared the average 2809.38 times, up to 10,795 times, failed to find 4 times.

When we reduce the number of hash tables (cut using the top five of T1):

>>lshstats(T1(1:5),'test',patches,patches(:,1:1000),2);Running test...10% 20% 30% 40% 50% 60% 70% 80% 90% 100% # of comparisons: mean 1071.22, max 8239, failures: 36

Average and maximum contrast decreases significantly, but relative failure lookups increase

If you use the first 10:

>>lshstats(T1(1:10),'test',patches,patches(:,1:1000),2);Running test...10% 20% 30% 40% 50% 60% 70% 80% 90% 100% # of comparisons: mean 1728.92, max 9379, failures: 11

The number of visible hash tables is a tradeoff about the number of contrasts and the rate of failure.

Next we examine the relationship of key digits and test

>> T2=lsh('lsh',20,50,size(patches,1),patches,'range',255);>> lshstats(T2,'test',patches,patches(:,1:1000),2);  Running test...10% 20% 30% 40% 50% 60% 70% 80% 90% 100%    # of comparisons: mean 629.64, max 5716, failures: 401

We found that the comparison was further reduced but the failure rate soared. Let's add 20 more hash tables and test:

>> T2(21:40) = lsh('lsh',20,50,size(patches,1),patches,'range',255);>> lshstats(T2,'test',patches,patches(:,1:1000),2);  Running test...10% 20% 30% 40% 50% 60% 70% 80% 90% 100%   # of comparisons: mean 832.36, max 5934, failures: 320

Help seems to be not very big

Next we'll look at how to use the lookup feature. This is our target image:

>> figure(1);imagesc(reshape(patches(:,50),20,20));colormap gray;axis image

Let's look at its 10 nearest neighbors (parameters we use 11 because it will definitely find itself) and show:

>> tic; [nnlsh,numcand]=lshlookup(patches(:,50),patches,T2,'k',11,'distfun','lpnorm','distargs',{1});toc>> figure(2);clf;>> for k=1:10, subplot(2,5,k);imagesc(reshape(patches(:,nnlsh(k+1)),20,20)); colormap gray;axis image; end

How to use traversal lookups:

tic;d=sum(abs(bsxfun(@minus,patches(:,50),patches)));[ignore,ind]=sort(d);toc;Elapsed time is 0.220885 seconds.

The result is slightly different but it's OK.

If you want to use E2LSH scheme

>> Te=lsh('e2lsh',50,30,size(patches,1),patches,'range',255,'w',-4);

Note-4 is the interval parameter. 50 functions with 30 mappings per function.

Learn Local sensitive Hash

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Learn Local sensitive Hash

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Learn Local sensitive Hash

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support