GitHub Blog Address: http://shuaijiang.github.io/2014/10/18/knn_kmeans/
Introduction
The K-Nearest neighbor Method (KNN) is a basic classification and regression method. K-means is a simple and effective clustering method. Although the use of the two different, solve the problem is different, but there are many similarities in the algorithm, so put together, so as to better compare the similarities and differences. Algorithm Description KNN
Algorithm ideas:
If a sample is the most common in a sample in a feature space (that is, the nearest neighbor in the feature space), then the sample belongs to that category.
The three basic elements of K-nearest neighbor Model: The choice of K-value: The choice of K-value will have a significant effect on the result. A smaller K-value reduces the approximate error, but increases the estimated error, while the larger K-value reduces the estimated error, but increases the approximate error. Generally, cross-validation is used to select the optimal k value. Distance measure: The distance reflects the similarity of two instances in the feature space. can use Euclidean distance, Manhattan distance and so on. Classification decision-making rules: The majority vote is often used. K-means
Algorithm steps:
1. Randomly select K objects from n data as the initial cluster centers;
2. According to the mean value of each cluster object (the center object), the distance between each data point and these center objects is calculated, and the data is divided according to the minimum distance criterion;
3. Recalculate the mean value of each changed cluster cluster, and select the data with the lowest mean distance as the central object;
4. Cycle steps 2 and 3 until each cluster cluster is no longer changing.
The basic elements of the K-means method: The choice of K-value: that is, the determination of the category, similar to the K-value method in K-nearest neighbor. Distance measurement: Euclidean distance, Manhattan distance, etc. can be used. Application Examples Problem Description
The sex, height and weight of several people are known, given their height and weight. Consider the use of K-nearest neighbor algorithm to achieve gender classification, using K-means to achieve gender clustering. Data
Data collection: Https://github.com/shuaijiang/FemaleMaleDatabase
The dataset contains the training data set and the test data set, considering that the K-nearest neighbor algorithm and the K-means method are used to realize the classification and clustering of the sexes respectively.
By presenting the training data to the diagram, you can more visually observe the relationships and differences between data samples, as well as the differences between genders.
Data display KNN's classification results
Basic settings in KNN algorithm k=5 distance measurement: Euclidean distance classification decision rule: Majority poll test set: Https://github.com/shuaijiang/FemaleMaleDatabase/blob/master/test0.txt
Using the KNN algorithm, the results on the test set are as follows in the confusion matrix table. As can be seen from the table, the men in the test set are all classified correctly, one of the females in the test set is classified incorrectly and the others are classified correctly.
Confusion matrix |
Test:male |
Test:female |
Result:male |
20 |
1 |
Result:female |
0 |
14 |
(Note: Test:male, Test:female, respectively, indicate men and women in the test set, Result:male and Result:female, respectively, for men and women in the results.) The first element in the table: the Test:male column, the Result:male row, indicates the number of tests in which the set is male and the result is male. What other elements in the table mean and so on)
The correct rate of classification can be calculated by the above table: (20+14)/(20+14+1) = cluster result of 97.14% K-means
Basic setup of the K-means algorithm k=2 distance measurement: Euclidean distance maximum number of clusters: 200 category decision rules: Based on the majority of each cluster's decision category test set: Https://github.com/shuaijiang/FemaleMaleDatabase /blob/master/test0.txt
Confusion matrix |
Test:male |
Test:female |
Result:male |
20 |
1 |
Result:female |
0 |
14 |
(Table Note: This table is consistent with the above table content)
Since the selection of the initial center point is random, so each cluster result is not the same, the best case can be fully clustered correctly, the worst case of two cluster cluster is not separate, according to the majority of votes to determine the category, is marked the same category. KNN VS K-means
The same point of the two:
-The choice of K is similar
-Similar thinking: Judging the properties of a sample based on a recent sample
The difference between the two: The application scenario is different: The former is a classification or regression problem, the latter is a clustering problem; Algorithm complexity: The former O (n^2), the latter O (kmn), (k is the number of clusters, M is the number of clusters) Stability: The former is stable, the latter is unstable. Summary
In this paper, the K-nearest neighbor algorithm and the K-means algorithm are described in detail, and the algorithm steps are compared. On this basis, by applying the two methods to practical problems, we compare the similarities and differences and their merits and demerits more deeply. The authors also implemented K-nearest neighbor algorithm and K-means algorithm, and applied it to the specific problems, and finally obtained the results.
The above content inevitably has the mistake and the error, welcome correct correction. Source Code KNN:HTTPS://GITHUB.COM/SHUAIJIANG/KNN K-means:https://github.com/shuaijiang/k-means Reference article Lee. Statistical learning Methods-Tsinghua University Press. 2012