KNN algorithm Understanding

Source: Internet
Author: User
Tags square root

KNN algorithm Understanding78748014

I. Overview of Algorithms

1, KNN algorithm is also called K-nearest neighbor classification (k-nearest neighbor classification) algorithm. The simplest and most mundane classifier might be the rote classifier, remembering all the training data, matching the training data directly to the new data, and using its classification to classify the new data directly if there are training data of the same attribute. There is one obvious drawback to this approach, which is that it is likely that you will not be able to find an exact matching training record.
The KNN algorithm finds the K records closest to the new data from the training set, and then decides the categories of the new data according to their main classification. The algorithm involves 3 main factors: training set, distance or similar measure, K size.
2. Representative thesis discriminant Adaptive Nearest Neighbor classification Trevor Hastie and Rolbert Tibshirani IEEE transactions on PAI TERN analysis and Machine INTELLIGENCE, vol. 6, JUNE 1996 Http://www.stanford.edu/~hastie/Papers/dann_IEEE.pdf
3, the industry application of customer churn prediction, fraud detection, etc. (more suitable for the classification of rare events)
Ii. key points of the algorithm
1, the guiding ideology of KNN algorithm is "Jinzhuzhechi, Howl", by your neighbor to infer your category.
The calculation steps are as follows: 1) Distance: Given the test object, calculate the distance to each object in the training set 2) find a neighbor: delimit the nearest K training object, as the nearest neighbor of the test object 3) do classification: According to the main category of K nearest neighbor attribution, to classify the test object
2, distance or similarity measure what is the right distance to measure? The closer the distance should mean the greater the likelihood that these two points belong to a classification. The distance measurement includes European distance, angle cosine, and so on. For text classification, the use of cosine (cosine) to calculate the similarity is more appropriate than the European (Euclidean) distance.
3, the category of the decision to vote: The minority subordinate to the majority, the nearest neighbor in which category of points are divided into this category. Weighted voting method: According to the distance, the nearest neighbor's vote weighted, the closer the distance the greater the weight (weight is the inverse of the distance squared)
Iii. Advantages and Disadvantages
1, the advantages of simple, easy to understand, easy to implement, no need to estimate parameters, no training suitable for the classification of rare events (for example, when the loss rate is very low, such as less than 0.5%, structural loss prediction model) is particularly suitable for multi-classification problems (multi-modal, objects with multiple categories of labels), For example, according to genetic characteristics to determine its functional classification, KNN is better than SVM performance
2, the disadvantage of lazy algorithm, the test sample classification of large computational capacity, memory overhead, slow interpretation of the score is poor, can not give a decision tree rules.
Iv. Frequently Asked Questions
1, the K value is set to how big? K is too small, the classification results are susceptible to noise points, K is too large, the nearest neighbor may contain too many other categories of points. (For distance weighting, you can reduce the effect of K-value setting) k value is usually determined by cross-examination (k=1) Rule of thumb: K is generally lower than the square root of the number of training samples
2, how to determine the most appropriate category? The voting method does not take into account the proximity of the nearest neighbor, the nearest neighbor may be more likely to decide the final classification, so the weighted voting method is more appropriate.
3, how to choose the right distance measurement? The impact of high dimensions on distance measurement: It is well known that the more the number of variables, the more the Euclidean distance is less discriminating. The effect of variable range on distance: The variable with the larger range is often dominated by the distance calculation, so the variables should be normalized first.
4. Should training samples be treated equally? In the training set, some samples may be more worthy of reliance. Different weights can be applied to various samples to enhance the weight of dependent samples and reduce the impact of unreliable samples.
5, performance problems? KNN is a lazy algorithm, usually do not study hard, test (the test sample classification) only cramming (temporarily to find K nearest neighbor). The consequence of laziness: the construction model is very simple, but the system overhead of classifying the test samples is large, because all training samples are scanned and distances are computed. There are a number of ways to improve the efficiency of calculations, such as compressing training samples.
6. Can we drastically reduce the training sample size while maintaining the classification accuracy? Enrichment Technology (condensing) editing technology (editing)
Reference: Wikipedia: Http://zh.wikipedia.org/wiki/%E6%9C%80%E9%82%BB%E8%BF%91%E6%90%9C%E7%B4%A2 Baidu Encyclopedia: http://baike.baidu.com/ View/1485833.htm

KNN can be used to recommend:

Here we do not use KNN to achieve classification, we used KNN the most primitive algorithm idea, that is, for each content to find K and its most similar content, and recommend to the user.

Transferred from: http://blog.csdn.net/jmydream/article/details/8644004

KNN algorithm Understanding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.