Basic Classification Method--KNN (k nearest neighbor) algorithm

Source: Internet
Author: User
Tags svm

In this article http://www.cnblogs.com/charlesblc/p/6193867.html

In the process of speaking SVM, the KNN algorithm is mentioned. A little familiar, on the Internet a check, incredibly is k nearest neighbor algorithm, machine learning the entry algorithm.

The reference content is as follows: http://www.cnblogs.com/charlesblc/p/6193867.html

1, KNN algorithm is also called K-nearest neighbor classification (k-nearest neighbor classification) algorithm.
the simplest and most mundane classifier might be the rote classifier, remembering all the training data, matching the training data directly to the new data, and using its classification to classify the new data directly if there are training data of the same attribute. There is one obvious drawback to this approach, which is that it is likely that you will not be able to find an exact matching training record.

The KNN algorithm finds the K records closest to the new data from the training set, and then decides the categories of the new data according to their main classification. The algorithm involves 3 main factors:

Training sets , distances or similar measurements ,the size of K .

3. Industry Application
Customer churn prediction, fraud detection, etc. (more suitable for classification of rare events)

1. Guiding ideology
The guideline of the KNN algorithm is " Jinzhuzhechi, Howl ", which is inferred from your neighbor's category.

The calculation steps are as follows:
1) Distance: Given the test object, calculate its distance from each object in the training set
2) Find a neighbor: The nearest K training object is delineated, as the nearest neighbor of the test object
3) Classification: According to the main category of K nearest neighbor attribution, to classify the test object

2. Measurement of distance or similarity
What is the right distance measurement? The closer the distance should mean the greater the likelihood that these two points belong to a classification.
The distance measurement includes European distance , angle cosine , and so on.
For text categorization , the use of cosine (cosine) to calculate similarity is more appropriate than European (Euclidean) distances .

3, the classification of the determination
Voting decision: The minority obeys the majority, the nearest neighbor in which category of points is divided into this class.
Weighted voting method: According to the distance, the nearest neighbor's vote weighted, the closer the distance the greater the weight ( weight is the inverse of the distance squared )

1. Advantages
Simple, easy to understand, easy to implement, no need to estimate parameters, no training required
Suitable for classifying rare events (e.g., when the churn rate is low, for example, less than 0.5%, structural loss prediction model)
Especially suitable for multi-classification problems (multi-modal, objects with multiple categories of labels ), for example, according to genetic characteristics to determine its functional classification, KNN is better than SVM performance

2. Disadvantages
lazy Algorithm , when the test sample classification of large computational capacity , memory overhead, slow scoring
It is not possible to explain the rules of decision trees.

Iv. Frequently Asked Questions

1, the K value is set to how big?
K is too small, the classification results are susceptible to noise points, K is too large, the nearest neighbor may contain too many other categories of points. (For distance weighting, the effect of K-value setting can be reduced)
The k value is usually determined by cross-examination (k=1 as the benchmark)
Rule of thumb:K is generally lower than the square root of the number of training samples

2, how to determine the most appropriate category?
The voting method does not take into account the proximity of the nearest neighbor, the nearest neighbor may be more likely to decide the final classification, so the weighted voting method is more appropriate.

3, how to choose the right distance measurement?
The impact of high dimensions on distance measurement: It is well known that the more the number of variables , the more the Euclidean distance is less discriminating .
The effect of variable range on distance: The variable with the larger range is often dominated by the distance calculation, so the variables should be normalized first .

4. Should training samples be treated equally?
In the training set, some samples may be more worthy of reliance.
Different weights can be applied to various samples to enhance the weight of dependent samples and reduce the impact of unreliable samples.

5, performance problems?
KNN is a lazy algorithm, usually do not study hard, test (the test sample classification) only cramming (temporarily to find K nearest neighbor).
The consequence of laziness: the construction model is very simple, but the system overhead of classifying the test samples is large, because all training samples are scanned and distances are computed.
There are a number of ways to improve the efficiency of calculations, such as compressing training samples.

6. Can we drastically reduce the training sample size while maintaining the classification accuracy?
Enrichment Technology (condensing)
editing technology (editing)

KNN can be used to recommend:

Here we do not use KNN to achieve classification, we used KNN the most primitive algorithm idea, that is, for each content to find K and its most similar content , and recommend to the user.

(Note: Note the difference from collaborative filtering.) Collaborative filtering is another layer, first looking at the same content of users, and then through the user's favorite content to recommend, in the final analysis, because the similarity between the content can not be calculated, and see the same content of the user to look at the content as a similarity consideration.

Basic Classification Method--KNN (k nearest neighbor) algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.