KNN algorithm Understanding

Source: Internet
Author: User
Tags square root

I. Overview of Algorithms

1, KNN algorithm is also called K-nearest neighbor classification (k-nearest neighbor classification) algorithm.
The simplest and most mediocre classifier may be the rote classifier, remembering all the training data, and for the new data to match the training data directly, assuming that there is the same attribute of training data, then directly using its classification as the new data classification. There is one obvious drawback to this approach, which is that it is very likely that you will not find a fully matched training record.

The KNN algorithm finds the K records closest to the new data from the training set, and then decides the categories of the new data according to their main classification. The algorithm involves 3 main factors: training set, distance or similar measure, K size.

2. Representative thesis
discriminant Adaptive Nearest Neighbor classification
Trevor Hastie and Rolbert Tibshirani
IEEE Transactions on Paitern analysis and Machine INTELLIGENCE, vol. 6, JUNE 1996
Http://www.stanford.edu/~hastie/Papers/dann_IEEE.pdf

3. Industry Application
Customer churn prediction, fraud detection, etc. (more suitable for classification of rare events)

Ii. key points of the algorithm

1. Guiding ideology
The guideline of the KNN algorithm is "Jinzhuzhechi, Howl", which is judged by your neighbor's category.

The calculation process is as follows:
1) Distance: Given a test object, calculate the distance from each object in the training set
2) Looking for neighbors: to delineate the nearest K training object, as the nearest neighbor of the test object
3) Classification: According to the main category of K nearest neighbor attribution, to classify the test object

2. Measurement of distance or similarity
What is the right distance measurement? The closer the distance should mean the greater the likelihood that these two points belong to a classification.
The distance measured includes the Euclidean distance, the angle cosine, and so on.
For text classification, the use of cosine (cosine) to calculate the similarity is more appropriate than the European (Euclidean) distance.

3, the classification of the determination
Voting decision: The minority obeys the majority, the nearest neighbor in which category of points is divided into this class.
Weighted voting method: According to the distance, the nearest neighbor's vote is weighted, the closer the distance the greater the weight (weight is the inverse of the distance squared)

Iii. Advantages and Disadvantages

1. Strengths
Simple, easy to understand, easy to implement, no need for expected parameters, no training required
Suitable for classifying rare events (e.g., when the churn rate is very low, for example, less than 0.5%, to construct a loss prediction model)
Especially suitable for multi-classification problems (multi-modal, objects with multiple categories of tags), such as based on genetic characteristics to infer its functional classification, KNN is better than SVM performance

2. Disadvantages
Lazy algorithm, when the test sample classification of large computational capacity, memory overhead, slow scoring
It is not possible to explain the rules of decision trees.

Iv. Frequently Asked Questions

1, the K value is set to how big?
K is too small, the classification results are susceptible to noise points, K is too large, and the nearest neighbor may include too many other categories of points. (for distance weighting, can reduce the effect of K-value setting)
The k value is generally determined by cross-examination (k=1 as the benchmark)
Rule of thumb: K is generally lower than the square root of the number of training samples

2. How to determine the most appropriate category?
The voting method does not take into account the proximity of the nearest neighbor, the nearest neighbor may be more likely to decide the final classification, so the weighted voting method is more appropriate.

3, how to choose the right distance to measure?
The impact of high dimensions on distance measurement: It is well known that the more the number of variables, the more the Euclidean distance is less discriminating.
The effect of variable range on distance: the larger the range of variables often occupy the dominant role in the distance calculation, therefore, the variables should be standardized first.

4. Should training samples be treated equally?
In the training set, some samples may be more worthy of reliance.
The ability to apply different weights to different samples, strengthen the weight of dependent samples, and reduce the impact of unreliable samples.

5, performance problems?
KNN is a lazy algorithm, usually do not study hard, examination (test sample classification) only cramming (temporarily to find K nearest neighbor).
The consequence of laziness: the construction model is very easy, but the system overhead in classifying the test sample is large, because all training samples are scanned and distances are calculated.
There are some ways to improve the efficiency of calculations, such as compression training sample size.

6. Is it possible to drastically reduce the training sample size and maintain classification accuracy at the same time?
Enrichment Technology (condensing)
editing technology (editing)

References:
Wikipedia:
Http://zh.wikipedia.org/wiki/%E6%9C%80%E9%82%BB%E8%BF%91%E6%90%9C%E7%B4%A2
Baidu Encyclopedia: http://baike.baidu.com/view/1485833.htm

KNN can be used to recommend:

Here we do not use KNN to achieve classification, we used KNN the most primitive algorithm idea, that is, for each content to find K and its most similar content, and recommend to the user.

KNN algorithm Understanding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.