Ml_ cluster of nearest neighbor search

Last Update:2016-07-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

There is such a problem, said I am reading an article, feel good, want to look for a similar article from many books in the bookshelf to continue to read, what should I do?

So we think of the violence solution, and I'm a piece of a comparison, to find similar

The nearest neighbor concept is well understood, we know the distance between each article and the target article by calculation, select the smallest distance as the most similar candidate article or some of the least distance articles as a candidate article set.

Let's translate it into a more mathematical representation:

This is actually a question of measuring similarity (·?) How do we measure similarity? To complete these ideas, we need to address two major challenges:

Vectorization Representation of documents
Calculation of distance

The way the document is represented, often heard is the word bag model

The text of the word bag model is regarded as a set of disordered words, ignoring the grammar and word order, assuming that each term is independent and does not depend on the appearance of other words. This allows an article to represent a series of long vectors consisting of words, vector elements are word frequency. Obviously, the word frequency method does not fully consider the importance of the unique vocabulary, and then we introduce the TF-IDF algorithm.

The main idea of TFIDF is that if a word or phrase appears in an article with a high frequency of TF and is seldom seen in other articles, it is considered to be a good category-distinguishing ability and suitable for classification.

Distance calculation, in the similarity (http://www.cnblogs.com/sxbjdl/p/5708681.html) article has the specific study, here no longer wordy

OK next solve the next problem search algorithm (·? How does we search over articles? )

The simplest implementation of K-nearest neighbor method is linear scan, at this time to calculate the input instance and each training instance distance, when the training set is very large, the calculation is very time-consuming, this method is not feasible. In order to improve the efficiency of K-nearest neighbor search, we can consider using special structure to store training data to reduce the number of distance calculation.

There are many ways to do this, here are the kd tree methods (see http://blog.csdn.net/likika2012/article/details/39619687)

? Kd-trees is cool, but ...
-? Non-trivial to implement efficiently
-? Problems with high-dimensional data

Therefore, leads to LSH

Locality sensitive hashing (local sensitive hash lsh)
See blog http://blog.csdn.net/icvpr/article/details/12342159
Low-dimensional small data sets we can find similarities by linear lookups, and the high-dimensional data is very inefficient (very time-consuming) to use this method. To solve this problem, we use some similar indexing techniques to speed up lookups (such as nearest neighbor lookup, approximate nearest neighbor lookup). LSH is a kind of method in Ann (approximate Nearest Neighbor), the basic idea is: if we do some hash mapping of raw data, we hope that the original two data can be hashed into the same bucket, with the same bucket number. After all the data in the original data set is hashed, we get a hash table, which is scattered in the bucket of the hash table, each bucket falls into some raw data, and the data in the same bucket is likely to be adjacent, Of course, there are non-adjacent data is hashed into the same bucket. Therefore, if we can find such hash functions, so that after their hash mapping transformation, the original space in the adjacent data into the same bucket, then we in the data set to find the nearest neighbor becomes easy, we just need to hash the query data to get its bucket number, Then take out the bucket number corresponding to all the data in the bucket, and then a linear match to find the data adjacent to the query data.
It is important to note that LSH is not guaranteed to be able to find the data closest to query data point, but rather to reduce the number of data points that need to be matched while guaranteeing a high probability of finding the nearest neighbor's data points.
LSH Application scenarios: Check weight, image retrieval, music retrieval, fingerprint matching

Ml_ cluster of nearest neighbor search

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Ml_ cluster of nearest neighbor search

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Ml_ cluster of nearest neighbor search

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support