Lsh︱python realization of locally sensitive random projection forest--lshforest/sklearn (i.)

Source: Internet
Author: User
Tags cos

About the local sensitive hashing algorithm. Previously implemented in the R language, but because the performance in R is too low. So give up using LSH to do similar sex search. Learning Python finds that many modules can be implemented and makes querying data faster by randomly projecting forests. Feel able to try large-scale applications in data-similarity retrieval + deduplication scenarios.

In the private sense, the similarity of text can be divided into two categories: one is mechanical similarity, the other is semantic similarity.
Mechanical similarity represents the degree of relevance of two text content. For example, "How are You" and "Hello" similarity. Purely represents the content of the character is completely co-occurrence, the application scenario in: The article to go heavy;
Semantic similarity represents a similar degree in semantics of two text. For example, "Apple" and "Company" similarity. This article does not do this discussion

Before writing about the implementation of the R language blog:
R language Implementation ︱ local sensitive hashing algorithm (LSH) solves the problem of mechanical similarity of text (I, basic principle)
The R language implements the ︱ local sensitive hashing algorithm (LSH) to solve textual mechanical similarity problems (two. Textreuse introduction)

The four parts of the mechanical-similar Python version:
Lsh︱python realization of locally sensitive random projection forest--lshforest/sklearn (i.)
Lsh︱python implementing a locally sensitive hash--lshash (ii)
Similarity ︱PYTHON+OPENCV implementation Phash algorithm +hamming distance (Simhash) (three)
Lsh︱python realization Minhash-lsh and Minhash LSH Forest--datasketch (iv)
.

A random projection of the forest

This section is about: research on scene text image clustering method based on random projection and blog random projection forest-an approximate near-term neighbor method (ANN)

1. Random projection forest theory and implementation pseudo-code

When the number of data is relatively large, the time spent searching for KNN by linear search is too high, and it is unrealistic to read all the data in memory. Therefore, in actual project, using approximate near neighbor is the Ann problem.
One method is to use the random projection tree, the whole data is divided, the number of points per search and calculation to an acceptable range, and then set up a number of random projection trees to form a random projection forest, the comprehensive results of the forest as a result.

? The process of building a random projection tree is, for example, the following (in two-dimensional space):

    • Randomly select a vector from the origin.
    • A line perpendicular to this vector divides the points in the plane into two parts
    • Divide the points belonging to these two parts into the right subtree of Saozi

In mathematical calculations. is by calculating the dot product of each point and the vertical vector complete this step, dot product greater than 0 points divided into the left subtree, dot product less than 0 points divided into the right subtree.


Be careful. A straight line with no arrows is the basis for dividing the left and right subtrees, and the vector with arrows is used to calculate the dot product. In this way, the original point is divided into two parts, the graph scale is as follows:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvc2luyxrfmjy5mtczodm=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma== /dissolve/70/gravity/southeast "alt=" here to write a picture descriptive narrative "title=" ">
However, the number of points within a dividing result is still much higher. So continue to divide. Select a vector again randomly. A line perpendicular to the vector divides all points. The graph scale is as follows:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvc2luyxrfmjy5mtczodm=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma== /dissolve/70/gravity/southeast "alt=" here to write a picture descriptive narrative "title=" ">
Note that the division at this time is based on the previous division.


? That is, now the points in the graph have been divided into four parts, corresponding to a depth of 2. A tree with four leaf nodes.

And so on, until the number of midpoints in each leaf node reaches a sufficiently small number.

Notice that the tree is not a complete tree.

The establishment of a randomly projected forest requires two parameters. That is, the depth of a single tree + the number of forests.


These two parameters determine the degree of dispersion of the dataset and the number of vector dimensions that are obtained after a random projection.

Using this tree to calculate the new point in the near future, first calculate the dot product of the vector used for each partition. To find the leaf node that it belongs to, and then use this leaf node in the?? These points are computed for the near neighbor algorithm.


This process is the computational process of a random projection tree. Use the same method. Multiple random projection trees are constructed to form a random forest, and the result of the total forest is taken as a result.


.

2, the corresponding expansion

Wright and others have applied the method of random projection to face recognition with changing perspective, and Nowak and others use the method of random projection to study the similarity measure of visual words. Freund and other people have applied the random projection to the handwriting recognition, and achieved very good results.
.

3. Random projection forest structure vector + cluster

In the research of scene text image clustering method based on random projection, each leaf node is treated as one-dimensional feature, and the number of feature points of leaf node is used as the descriptive narration of leaf nodes, and the characteristic vectors of the test images are finally obtained.


A little like the Hoffman tree in the Word2vec.

The experimental results in this paper are as follows:

Of The forest scale is 10 trees.

    • The first set of experiments. Descriptive narrative using Sift local features. In the different deep. The accuracy of the recognition at the depth of the tree. Among f= (2 * R *
      P)/(R+p), in general view of depth deep=8. More reasonable.
    • The second group of experiments, AP clustering and Kmeans clustering at different depths of the difference, the experimental data is the Google picture set, the local feature description of the use of Asift method, with AP and Kmeans respectively clustering. Since the class number of the AP clustering algorithm is determined by the value of the diagonal element of the similar matrix, it is necessary to test several times, finally, the median value of the similarity matrix is the element value on the diagonal of the similarity matrix, which is used to control the classification number of the cluster. The value of the AP Cluster Evaluation Index is the average of many experiments.

      and K-means
      Clustering is multiple experiments with different iterations and number of classes, with the best clustering results as the last result

    • The third group of experimental experimental data is the Google Picture collection. Clustering algorithm using AP clustering, with different local features descriptive narrative method (Asift and SIFT) obtained by clustering results asift Local feature description of the results than the SIFT method on the various indicators are higher than 10%.

Thus. Asift is better and more accurate than SIFT's local feature description of text area images under natural scenes. This is because SIFT is only a scale and rotational invariance. For the same text with the change of perspective, the matching descriptive narrative cannot be obtained. And Asift not only has the scale rotation invariance of the image, but also has affine invariance, which has better usefulness to the text processing under the natural scene.
The specific asift and sift are visible in the paper.


.

Second, Lshforest/sklearn

lshforest=lsh+ Random Projection Tree
Lshforest can be implemented in Python's sklearn.

Official documents in: Sklearn.neighbors.LSHForest

1. Main function Lshforest
class sklearn.neighbors.LSHForest(n_estimators=10, radius=1.0, n_candidates=50, n_neighbors=5, min_hash_match=4, radius_cutoff_ratio=0.9, random_state=None)

The random projection forest is an alternative method for the recent neighborhood search.


LSH forest data structures use sorted arrays, binary searches, and 32-bit fixed-length hash representations.

The random projection calculates the distance using the approximate cosine distance.

n_estimators: Int ( Default = ten)The number of treesMin_hash_match: Int ( Default = 4)Minimum hash search length/number. Less then stopn_candidates: Int ( Default = ten)Each tree evaluates the minimum value of the quantity. Anyway at least every tree to be evaluated several times, rain equitablyn_neighbors: Int ( Default = 5)When retrieving. The minimum number of neighbors, I'm afraid you forgot to set the search quantity.radius: Float, optinal ( Default = 1.0)When retrieving. Distance radius of neighboring individualsRadius_cutoff_ratio: Float, Optional ( Default = 0.9)When retrieving, the lower limit of the radius, equivalent to the probability of similarity less than a threshold value, stop searching, or the minimum hash search length is less than4Also stoprandom_state: int,randomstate instanceOrNone, Optional ( Default=None)The random number generator uses the seed. Default does not

Included Properties:

ofarray, shape (n_estimators, n_samples)Eachtofunction) 每棵树相应一个哈希散列。且这个哈希散列是经过排序的。显示的是哈希值。

n_estimators棵树。n_samples个散列。

original_indices_array, shape (n_estimators, n_samples)每棵树相应一个哈希散列,哈希散列是经过排序的。显示的是原数据序号index

Trees_ and Original_indices_ are two states, Trees_ is the hash of each sorted tree, and Original_indices_ is the ordinal index of each sorted tree.
.

2. Lshforeast Correlation function
    • Fit (x[, y])

Fit the LSH forest on the data.
Data Load Projection Tree

    • Get_params ([deep])

Get parameters for this estimator.
Get the relevant number of references in the tree

    • Kneighbors (X, N_neighbors=none, Return_distance=true)

Retrieves the function that n_neighbors represents the desired number of neighbors. If not set, the number of initialization settings is returned. Return_distance, whether to print/return a sample of a specific cos distance.
Returns an array of two. One is the distance array. One is the probability array

    • Kneighbors_graph ([X, n_neighbors, mode])

Computes the (weighted) graph of k-neighbors for points in X
Quantity Search Graph, n_neighbors represents the desired number of neighbors, if not set then return the number of initialization settings, mode= ' connectivity ' default

    • Partial_fit (x[, y])

Add data to the tree, preferably in bulk import.

    • Radius_neighbors (x[, RADIUS, return_distance])

Finds the neighbors within a given radius of a point or points.
Radius retrieval. Find the nearest neighbor within a given interval radius, radius length. Return_distance represents whether the content is printed or not.

    • Radius_neighbors_graph ([X, RADIUS, mode])

Computes the (weighted) graph of neighbors for points in X
RADIUS Search Diagram

    • Set_params (**params)

Set the parameters of this estimator.
Reset part of the parameters
.

3. A case
>>> from sklearn.neighbors import lshforest>>> x_train = [[5,5,2], [ +,5,5], [1,1,1], [8,9,1], [6,Ten,2]]>>> x_test = [[9,1,6], [3,1,Ten], [7,Ten,3]]>>> LSHF = lshforest (random_state= the) >>> Lshf.fit (X_train) lshforest (min_hash_match=4, n_candidates= -, n_estimators=Ten, n_neighbors=5, radius=1.0, radius_cutoff_ratio=0.9, random_state= the) >>> distances, indices = Lshf.kneighbors (X_test, n_neighbors=2) >>> distances Array ([[[0.069...,0.149...],       [0.229...,0.481...],       [0.004...,0.014...]]) >>> Indicesarray ([[1,2],       [2,0],       [4,0]])

Initialization of the lshforest (random_state=42) tree,
Lshf.fit (x_train) starts loading data into the initialized tree;
Lshf.kneighbors (X_test, n_neighbors=2). Find the top 2 (n_neighbors) similar content for each element of X_test.


Of This is the Cos distance, not the similarity, suppose to be intuitive, can be reduced by 1.


.

4, the case two

From: The similarity between two documents with Docsim/doc2vec/lsh

# Use LSH to processTfidf_vectorizer= Tfidfvectorizer (min_df=3, Max_features=none, ngram_range= (1, 2), use_idf=1, smooth_idf=1,sublinear_tf=1)train_documents= [] for item_text in raw_documents:item_str = Util_words_cut.get_class_words_with_space (item_text) TRAIN_DOCU Ments.append (ITEM_STR)X_train= Tfidf_vectorizer.fit_transform (train_documents)test_data_1= ' Hello. I want to ask, I want a divorce, he doesn't want to leave, kid. He said no, six months on his own initiative to enter into divorce 'test_cut_raw_1= Util_words_cut.get_class_words_with_space (test_data_1)x_test= Tfidf_vectorizer.transform ([test_cut_raw_1])LSHF= Lshforest (random_state=42) Lshf.fit (X_train.toarray ()) distances, indices = Lshf.kneighbors (X_test.toarray (), N_ neighbors=3) print (distances) print (indices)

General LSH is more suitable for short text

.

Related development:

Related properties Get

# 属性# 每棵树,排序散列的哈希值# 每棵树的hash公式lshf.original_indices_# 每棵树,排序散列的序号index

Map of recent neighborhood searches: kneighbors_graph

lshf.kneighbors_graph(X_test, n_neighbors=5, mode=‘connectivity‘)

Add data to the tree:

partial_fit(X_test)

Lsh︱python implement local sensitive random projection forest--lshforest/sklearn (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.