Ml_ cluster of nearest neighbor search

Source: Internet
Author: User

There is such a problem, said I am reading an article, feel good, want to look for a similar article from many books in the bookshelf to continue to read, what should I do?

So we think of the violence solution, and I'm a piece of a comparison, to find similar

The nearest neighbor concept is well understood, we know the distance between each article and the target article by calculation, select the smallest distance as the most similar candidate article or some of the least distance articles as a candidate article set.

Let's translate it into a more mathematical representation:

This is actually a question of measuring similarity (·?) How do we measure similarity? To complete these ideas, we need to address two major challenges:

    1. Vectorization Representation of documents
    2. Calculation of distance

The way the document is represented, often heard is the word bag model

The text of the word bag model is regarded as a set of disordered words, ignoring the grammar and word order, assuming that each term is independent and does not depend on the appearance of other words. This allows an article to represent a series of long vectors consisting of words, vector elements are word frequency. Obviously, the word frequency method does not fully consider the importance of the unique vocabulary, and then we introduce the TF-IDF algorithm.

The main idea of TFIDF is that if a word or phrase appears in an article with a high frequency of TF and is seldom seen in other articles, it is considered to be a good category-distinguishing ability and suitable for classification.

Distance calculation, in the similarity (http://www.cnblogs.com/sxbjdl/p/5708681.html) article has the specific study, here no longer wordy

OK next solve the next problem search algorithm (·? How does we search over articles? )

The simplest implementation of K-nearest neighbor method is linear scan, at this time to calculate the input instance and each training instance distance, when the training set is very large, the calculation is very time-consuming, this method is not feasible. In order to improve the efficiency of K-nearest neighbor search, we can consider using special structure to store training data to reduce the number of distance calculation.

There are many ways to do this, here are the kd tree methods (see http://blog.csdn.net/likika2012/article/details/39619687)

? Kd-trees is cool, but ...
-? Non-trivial to implement efficiently
-? Problems with high-dimensional data

Therefore, leads to LSH

Locality sensitive hashing (local sensitive hash lsh)
See blog http://blog.csdn.net/icvpr/article/details/12342159
Low-dimensional small data sets we can find similarities by linear lookups, and the high-dimensional data is very inefficient (very time-consuming) to use this method. To solve this problem, we use some similar indexing techniques to speed up lookups (such as nearest neighbor lookup, approximate nearest neighbor lookup). LSH is a kind of method in Ann (approximate Nearest Neighbor), the basic idea is: if we do some hash mapping of raw data, we hope that the original two data can be hashed into the same bucket, with the same bucket number. After all the data in the original data set is hashed, we get a hash table, which is scattered in the bucket of the hash table, each bucket falls into some raw data, and the data in the same bucket is likely to be adjacent, Of course, there are non-adjacent data is hashed into the same bucket. Therefore, if we can find such hash functions, so that after their hash mapping transformation, the original space in the adjacent data into the same bucket, then we in the data set to find the nearest neighbor becomes easy, we just need to hash the query data to get its bucket number, Then take out the bucket number corresponding to all the data in the bucket, and then a linear match to find the data adjacent to the query data.
It is important to note that LSH is not guaranteed to be able to find the data closest to query data point, but rather to reduce the number of data points that need to be matched while guaranteeing a high probability of finding the nearest neighbor's data points.
LSH Application scenarios: Check weight, image retrieval, music retrieval, fingerprint matching

Ml_ cluster of nearest neighbor search

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.