K Nearest Neighbor Method (KNN) and K-means (with source code)

Source: Internet
Author: User
GitHub Blog Address: http://shuaijiang.github.io/2014/10/18/knn_kmeans/ Introduction

The K-Nearest neighbor Method (KNN) is a basic classification and regression method. K-means is a simple and effective clustering method. Although the use of the two different, solve the problem is different, but there are many similarities in the algorithm, so put together, so as to better compare the similarities and differences. Algorithm Description KNN

Algorithm ideas:
If a sample is the most common in a sample in a feature space (that is, the nearest neighbor in the feature space), then the sample belongs to that category.

The three basic elements of K-nearest neighbor Model: The choice of K-value: The choice of K-value will have a significant effect on the result. A smaller K-value reduces the approximate error, but increases the estimated error, while the larger K-value reduces the estimated error, but increases the approximate error. Generally, cross-validation is used to select the optimal k value. Distance measure: The distance reflects the similarity of two instances in the feature space. can use Euclidean distance, Manhattan distance and so on. Classification decision-making rules: The majority vote is often used. K-means

Algorithm steps:
1. Randomly select K objects from n data as the initial cluster centers;
2. According to the mean value of each cluster object (the center object), the distance between each data point and these center objects is calculated, and the data is divided according to the minimum distance criterion;
3. Recalculate the mean value of each changed cluster cluster, and select the data with the lowest mean distance as the central object;
4. Cycle steps 2 and 3 until each cluster cluster is no longer changing.

The basic elements of the K-means method: The choice of K-value: that is, the determination of the category, similar to the K-value method in K-nearest neighbor. Distance measurement: Euclidean distance, Manhattan distance, etc. can be used. Application Examples Problem Description

The sex, height and weight of several people are known, given their height and weight. Consider the use of K-nearest neighbor algorithm to achieve gender classification, using K-means to achieve gender clustering. Data

Data collection: Https://github.com/shuaijiang/FemaleMaleDatabase

The dataset contains the training data set and the test data set, considering that the K-nearest neighbor algorithm and the K-means method are used to realize the classification and clustering of the sexes respectively.

By presenting the training data to the diagram, you can more visually observe the relationships and differences between data samples, as well as the differences between genders.
Data display KNN's classification results

Basic settings in KNN algorithm k=5 distance measurement: Euclidean distance classification decision rule: Majority poll test set: Https://github.com/shuaijiang/FemaleMaleDatabase/blob/master/test0.txt

Using the KNN algorithm, the results on the test set are as follows in the confusion matrix table. As can be seen from the table, the men in the test set are all classified correctly, one of the females in the test set is classified incorrectly and the others are classified correctly.

Confusion matrix Test:male Test:female
Result:male 20 1
Result:female 0 14

(Note: Test:male, Test:female, respectively, indicate men and women in the test set, Result:male and Result:female, respectively, for men and women in the results.) The first element in the table: the Test:male column, the Result:male row, indicates the number of tests in which the set is male and the result is male. What other elements in the table mean and so on)
The correct rate of classification can be calculated by the above table: (20+14)/(20+14+1) = cluster result of 97.14% K-means

Basic setup of the K-means algorithm k=2 distance measurement: Euclidean distance maximum number of clusters: 200 category decision rules: Based on the majority of each cluster's decision category test set: Https://github.com/shuaijiang/FemaleMaleDatabase /blob/master/test0.txt

Confusion matrix Test:male Test:female
Result:male 20 1
Result:female 0 14

(Table Note: This table is consistent with the above table content)

Since the selection of the initial center point is random, so each cluster result is not the same, the best case can be fully clustered correctly, the worst case of two cluster cluster is not separate, according to the majority of votes to determine the category, is marked the same category. KNN VS K-means

The same point of the two:
-The choice of K is similar
-Similar thinking: Judging the properties of a sample based on a recent sample

The difference between the two: The application scenario is different: The former is a classification or regression problem, the latter is a clustering problem; Algorithm complexity: The former O (n^2), the latter O (kmn), (k is the number of clusters, M is the number of clusters) Stability: The former is stable, the latter is unstable. Summary

In this paper, the K-nearest neighbor algorithm and the K-means algorithm are described in detail, and the algorithm steps are compared. On this basis, by applying the two methods to practical problems, we compare the similarities and differences and their merits and demerits more deeply. The authors also implemented K-nearest neighbor algorithm and K-means algorithm, and applied it to the specific problems, and finally obtained the results.
The above content inevitably has the mistake and the error, welcome correct correction. Source Code KNN:HTTPS://GITHUB.COM/SHUAIJIANG/KNN K-means:https://github.com/shuaijiang/k-means Reference article Lee. Statistical learning Methods-Tsinghua University Press. 2012

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.