Lsh︱python realization of locally sensitive random projection forest--lshforest/sklearn (i.)

Source: Internet
Author: User
Tags cos

The local sensitive hashing algorithm was previously implemented with the R language, but because of its low performance in R, the LSH was discarded for similarity retrieval. Learn python found a lot of modules can be achieved, and by random projection of the forest to make query data faster, think you can try to large-scale application in the data similarity retrieval + deduplication scene.

In private, the similarity of text can be divided into two categories: one is mechanical similarity, the other is semantic similarity.
Mechanical similarity represents the degree of relevance of the two text content, such as "Hello" and "hello" similarity, purely representing the content of the characters are fully present, the application scenario in: the article to heavy;
Semantic similarity represents the similarity of two text semantics, such as "Apple" and "Company", which is not discussed in this article.

Before writing about the implementation of the R language blog:
R language Implementation ︱ local sensitive hashing algorithm (LSH) solve the problem of mechanical similarity of text (I., basic principle)
R language Implementation ︱ local sensitive hashing algorithm (LSH) solves the problem of mechanical similarity of text (two, Textreuse introduction)

Mechanical similarity python version of the four section:
Lsh︱python realization of locally sensitive random projection forest--lshforest/sklearn (i.)
Lsh︱python implementing a locally sensitive hash--lshash (ii)
Similarity ︱PYTHON+OPENCV realization Phash algorithm +hamming distance (Simhash) (three)
Lsh︱python realization Minhash-lsh and Minhash LSH Forest--datasketch (iv)
.

A random projection of the forest

This section references: The research of scene text image clustering method based on random projection and blog random projection forest-an approximate nearest neighbor method (ANN)

1. Random projection forest theory and implementation pseudo-code

When the number of data is large, the time spent searching for KNN by linear search is too large, and it is unrealistic to read all the data in memory. Therefore, in practical engineering, the use of approximate nearest neighbor is the Ann problem.
One method is to use the random projection tree to divide all the data, reduce the number of points per search and calculation to an acceptable range, and then set up multiple random projection trees to form a random projection forest, which will be the final result of the forest synthesis results.

? The process of building a random projection tree is as follows (in two-dimensional space, for example):

    • Randomly select a vector from the origin.
    • A line perpendicular to this vector divides the points in the plane into two parts
    • Divide the points belonging to these two parts into the right subtree of Saozi

In mathematical calculation, the point product is divided into the left sub-tree by calculating the point product of each point and the vertical vector, and the point product is divided into the right sub-tree by the points with a point greater than 0.
Note that a straight line with no arrows in the graph is the basis for dividing the left and right sub-trees, and the vector with arrows is used to calculate the dot product. In this way, the original point is divided into two parts, the legend is as follows:

However, the number of points within a dividing result is still more, so the division continues. Randomly selects a vector again, and the line perpendicular to the vector divides all points. The legend is as follows:

Note that the division at this time is based on the previous division.
? That is, the points in the diagram are now divided into four parts, corresponding to a tree with a depth of 2 and four leaf nodes. And so on, until the number of midpoints in each leaf node reaches a sufficiently small number. Note that this tree is not a complete tree.

The establishment of a random projection forest requires two parameters, that is, the depth of a single tree and the number of forests.
These two parameters determine the degree of dispersion of the dataset and the number of vector dimensions that are obtained after a random projection.

Using this tree to calculate the nearest neighbor of the new point, we first find the leaf node belonging to it by calculating the dot product of the vector used for each partition, and then use the one in the leaf node. These points are computed for the nearest neighbor algorithm.
This process is a random projection tree calculation process, using the same method, the establishment of multiple random projection tree to form a random forest, the total forest results as the final result.
.

2, the corresponding expansion

Wright and other people have applied the method of random projection to face recognition, Nowak and other people use the method of random projection to learn the similarity measure of visual words, Freund and others have applied the random projection to handwriting recognition, and achieved good results.
.

3. Random projection forest structure vector + cluster

In the research of scene text image clustering method based on random projection, each leaf node is treated as one-dimensional feature, the number of feature points of leaf node is used as the description of leaf node, and the characteristic vector of the test image is finally obtained.
A little like the Hoffman tree in the Word2vec.

The experimental results in this paper are as follows:

Among them, the forest scale of 10 trees.

    • The first group of experiments, using SIFT Local feature description, in different deep, tree depths to identify the accuracy rate. where f= (2 * R *
      P)/(R+p), in general view of depth deep=8, more reasonable.
    • The second group of experiments, AP clustering and Kmeans clustering at different depths of the difference, the experimental data is the Google picture set, the local features described using the Asift method, with APS and Kmeans respectively clustering. Because the class number of the AP clustering algorithm is determined by the value of the diagonal element of the similar matrix, it needs to be tested several times, and finally, the median value of the similarity matrix is the element value on the diagonal of the similarity matrix, which is used to control the classification number of the cluster. The value of the AP Cluster Evaluation Index is the average of many experiments. and K-means
      Clustering is the number of iterations and categories of multiple experiments, with the best clustering results as the final result

    • The third group of experimental data is the Google picture set, clustering algorithm using AP clustering, with different local feature description method (Asift and SIFT) obtained by the clustering results asift local characteristics of the results than the SIFT method on the various indicators are higher than 10%.

Thus, asift than sift to the natural scene of the text region image of the local feature description is better and more accurate, this is because sift only has scale and rotation invariance, for the same text with a change of perspective can not be matched description, and Asift not only the image has a scale rotation invariance, It also has affine invariance, which is more practical for text processing in natural scenes.
Detailed asift and sift can be seen in the paper.
.

Second, Lshforest/sklearn

lshforest=lsh+ Random Projection Tree
Lshforest can be implemented in Python's sklearn. Official documents in: Sklearn.neighbors.LSHForest

1. Main function Lshforest
class sklearn.neighbors.LSHForest(n_estimators=10, radius=1.0, n_candidates=50, n_neighbors=5, min_hash_match=4, radius_cutoff_ratio=0.9, random_state=None)

Random projection forest is an alternative method of nearest neighbor search.
LSH forest data structures use sorted arrays, binary searches, and 32-bit fixed-length hash representations. The random projection calculates the distance using the approximate cosine distance.

n_estimators: Int ( Default = ten)The number of treesMin_hash_match: Int ( Default = 4)Minimum hash search length/number, less than stopn_candidates: Int ( Default = ten)Each tree evaluates the minimum number of values, anyway at least every tree to be evaluated several times, rain equitablyn_neighbors: Int ( Default = 5)When retrieving, the minimum number of neighbors, I'm afraid you forgot to set the number of retrievalradius: Float, optinal ( Default = 1.0)The distance radius of neighboring individuals when retrievingRadius_cutoff_ratio: Float, Optional ( Default = 0.9)When retrieving, the lower limit of the radius, equal to the probability of similarity less than a threshold value, stop searching, or the minimum hash search length is less than4Also stoprandom_state: int,randomstate instanceOrNone, Optional ( Default=None)The random number generator uses the seed, which by default does not

Included Properties:

ofarray, shape (n_estimators, n_samples)Eachtofunction) 每棵树对应一个哈希散列,且这个哈希散列是经过排序的。显示的是哈希值。n_estimators棵树,n_samples个散列。original_indices_ :array, shape (n_estimators, n_samples)每棵树对应一个哈希散列,哈希散列是经过排序的,显示的是原数据序号index

Trees_ and Original_indices_ are two states, Trees_ is the hash of each sorted tree, and Original_indices_ is the ordinal index of each sorted tree.
.

2. Lshforeast Correlation function
    • Fit (x[, y])

Fit the LSH forest on the data.
Data loading into the projection tree

    • Get_params ([deep])

Get parameters for this estimator.
Get the relevant parameters in the tree

    • Kneighbors (X, N_neighbors=none, Return_distance=true)

Retrieves a function, n_neighbors represents the desired number of neighbors, if not set, returns the number of initialization settings, return_distance, whether to print/return a sample of a specific cos distance.
Returns two array, one is the distance array, the other is the probability array

    • Kneighbors_graph ([X, n_neighbors, mode])

Computes the (weighted) graph of k-neighbors for points in X
Quantity Search Graph, n_neighbors represents the desired number of neighbors, if not set then return the number of initialization settings, mode= ' connectivity ' default

    • Partial_fit (x[, y])

Add data to the tree, preferably in bulk import.

    • Radius_neighbors (x[, RADIUS, return_distance])

Finds the neighbors within a given radius of a point or points.
Radius retrieval, searching for the nearest neighbor within a given interval radius, radius length, return_distance represents whether the content is printed or not.

    • Radius_neighbors_graph ([X, RADIUS, mode])

Computes the (weighted) graph of neighbors for points in X
RADIUS Search Diagram

    • Set_params (**params)

Set the parameters of this estimator.
Resetting partial parameters
.

3. A case
>>> from sklearn.neighbors import lshforest>>> x_train = [[5,5,2], [ +,5,5], [1,1,1], [8,9,1], [6,Ten,2]]>>> x_test = [[9,1,6], [3,1,Ten], [7,Ten,3]]>>> LSHF = lshforest (random_state= the) >>> Lshf.fit (X_train) lshforest (min_hash_match=4, n_candidates= -, n_estimators=Ten, n_neighbors=5, radius=1.0, radius_cutoff_ratio=0.9, random_state= the) >>> distances, indices = Lshf.kneighbors (X_test, n_neighbors=2) >>> distances Array ([[[0.069...,0.149...],       [0.229...,0.481...],       [0.004...,0.014...]]) >>> Indicesarray ([[1,2],       [2,0],       [4,0]])

Initialization of the lshforest (random_state=42) tree,
Lshf.fit (X_train) begins to load data into the initialized tree;
Lshf.kneighbors (X_test, n_neighbors=2), find the first 2 (n_neighbors) similar content for each element of X_test.
Which, this is the Cos distance, not the similarity, if you want to be intuitive, can be 1 minus.
.

4, the case two

From: Comparing the similarity of two documents with Docsim/doc2vec/lsh

# Use LSH to processTfidf_vectorizer= Tfidfvectorizer (min_df=3, Max_features=none, ngram_range= (1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)train_documents= [] for item_text in raw_documents:item_str = Util_words_cut.get_class_words_with_space (item_text) TRAIN_DOCU Ments.append (ITEM_STR)X_train= Tfidf_vectorizer.fit_transform (train_documents)test_data_1= ' Hello, I want to ask you I want to divorce he doesn't want to leave, kid he said no, it's six months on automatic divorce 'test_cut_raw_1= Util_words_cut.get_class_words_with_space (test_data_1)x_test= Tfidf_vectorizer.transform ([test_cut_raw_1])LSHF= Lshforest (random_state=42) Lshf.fit (X_train.toarray ()) distances, indices = Lshf.kneighbors (X_test.toarray (), N_ neighbors=3) print (distances) print (indices)

General LSH is more suitable for the comparison of short text

.

Related development:

Related properties Get

# 属性# 每棵树,排序散列的哈希值# 每棵树的hash公式lshf.original_indices_# 每棵树,排序散列的序号index

Graph of nearest neighbor search: kneighbors_graph

lshf.kneighbors_graph(X_test, n_neighbors=5, mode=‘connectivity‘)

Add data to the tree:

partial_fit(X_test)

Lsh︱python realization of locally sensitive random projection forest--lshforest/sklearn (i.)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.