Understanding of KNN algorithm

Source: Internet
Author: User
Tags square root

First, the algorithm

1, KNN algorithm is also called K-nearest neighbor classification (k-nearest neighbor classification) algorithm.


The simplest and most mediocre classifier is perhaps the kind of rote classifier that remembers all the training data. The new data is directly matched to the training data, assuming that the training data of the same attribute exists, then it is used as the classification of the new data. There is one obvious drawback to this approach, which is that it is very likely that you will not find a fully matched training record.



The KNN algorithm finds the K records closest to the new data from the training set. Then the categories of new data are determined based on their primary classification. The algorithm involves 3 main factors: training set, distance or similar measure, K size.



2. Representative thesis
discriminant Adaptive Nearest Neighbor classification
Trevor Hastie and Rolbert Tibshirani
IEEE Transactions on Paitern analysis and Machine INTELLIGENCE, vol. 6, JUNE 1996
Http://www.stanford.edu/~hastie/Papers/dann_IEEE.pdf

3. Industry Application
Customer churn prediction, fraud detection, etc. (more suitable for classification of rare events)

Ii. key points of the algorithm

1. Guiding ideology
The guideline of KNN algorithm is "Jinzhuzhechi, howl". Judge your category by your neighbor.

The calculation process is as follows:
1) Distance: Given a test object, calculate the distance from each object in the training set
2) Looking for neighbors: to delineate the distance of the recent K training objects. The nearest neighbor as a test object
3) Classification: According to the main category of K nearest neighbor attribution, to classify the test object

2. Measurement of distance or similarity
What is the right distance measurement? The closer the distance should mean the greater the likelihood that these two points belong to a classification.
The distance measured includes the Euclidean distance, the angle cosine, and so on.
For text categorization. Using the cosine (cosine) to calculate the similarity is more appropriate than the European (Euclidean) distance.

3, the classification of the determination
Voting decision: Few obey the majority. Which category of points in the nearest neighbor is divided into this class.
Weighted voting method: depending on the distance. The nearest neighbor's vote is weighted, the closer the distance the greater the weight (the weight is the reciprocal of the square of the distance)

Iii. Advantages and Disadvantages

1. Strengths
Simple. Easy to understand. Easy to implement with no expected parameters, no training required
Suitable for classifying rare events (e.g., when the churn rate is very low, for example, less than 0.5%, to construct a loss prediction model)
Especially suitable for multi-classification problems (multi-modal, objects with multiple categories of tags), such as based on genetic characteristics to infer its functional classification, KNN is better than SVM performance

2. Disadvantages
Lazy algorithm, when the test sample classification of the computational capacity is large. High memory overhead, slow scoring
It is not possible to explain the rules of decision trees.

Iv. Frequently Asked Questions

1, the K value is set to how big?
K is too small, the classification results are susceptible to noise points, K is too large, and the nearest neighbor may include too many other categories of points. (for distance weighting, can reduce the effect of K-value setting)
The k value is generally determined by cross-examination (k=1 as the benchmark)
Rule of thumb: K is generally lower than the square root of the number of training samples

2. How to determine the most appropriate category?
The voting method does not take into account the proximity of the nearest neighbor, the nearest neighbor may be more likely to decide the final classification, so the weighted voting method is more appropriate.

3, how to choose the right distance to measure?
The impact of high dimensions on distance measurement: It is well known that the more the number of variables, the more the Euclidean distance is less discriminating.
The effect of variable range on distance: the larger the range of variables often occupy the dominant role in the distance calculation, therefore, the variables should be standardized first.

4. Should training samples be treated equally?
In the training set, some samples may be more worthy of reliance.
The ability to apply different weights to different samples and to strengthen the weights of dependent samples. Reduce the impact of untrustworthy samples.



5, performance problems?
KNN is a lazy algorithm, usually do not study hard, examination (test sample classification) only cramming (temporarily to find K nearest neighbor).
The consequence of laziness: the construction model is very easy. However, the system overhead in classifying the test samples is large, because all training samples are scanned and distances are computed.
There are some ways to improve the efficiency of calculations, such as compression training sample size.



6. Is it possible to drastically reduce the training sample size and maintain classification accuracy at the same time?
Enrichment Technology (condensing)
editing technology (editing)

References:
Wikipedia:
Http://zh.wikipedia.org/wiki/%E6%9C%80%E9%82%BB%E8%BF%91%E6%90%9C%E7%B4%A2
Baidu Encyclopedia: http://baike.baidu.com/view/1485833.htm

KNN it can be used to recommend:

Here we have no KNN in order to achieve the classification, we use KNN's most primitive idea algorithm, each content search K it's one of the most similar content. and recommend to the user.

Understanding of KNN algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.