CategoryAlgorithmThere are many Bayesian, decision tree, support for Vector Product, KNN, etc., neural networks can also be used for classification. This articleArticleThis section describes KNN classification algorithms.
1. Introduction
KNN is short for K Nearest Neighbor. k is the nearest neighbor. K nearest neighbor is used to vote for the class label of the new instance. KNN is an instance-based learning algorithm. Unlike Bayesian and Decision Tree algorithms, KNN does not need to be trained. When a new instance appears, find K nearest instances in the training data set and assign the new instance to the class with the maximum number of instances in the K training instances. KNN also becomes a lazy learning. It does not need to be trained, and the classification accuracy is very high when the class label boundaries are neat. The KNN algorithm needs to manually determine the value of K, that is, to find several recent instances. Different K values may lead to different classification results.
2. Example
As shown in the distribution of the Training dataset, the dataset is divided into three types (represented in three different colors in the figure). Now a new instance (green dots in the figure) is displayed ), suppose we have k = 3, that is, we are looking for three closest instances. The distance defined here is the Euclidean distance, in this way, the last three instances of the instance to be classified are Circle centered on green points, and a minimum radius is determined so that the circle contains k points.
In the red circle, there are three in Category 2 and one in category 3, but none in Category 1. The vote is based on the principle that the minority follows the majority, the new green instance should belong to two categories.
3. Select the K value.
As mentioned before, the selection of K value will affect the classification result, so the value of K value is reasonable. We continue with the classification process mentioned above. Now we set K to 7, as shown in:
We can see that when k = 7, there are three in Class 1 in the last seven points, two in Class 2 and two in Class 3. In this case, the new green instance should be assigned to Class 1, this is different from the classification result when k = 5.
K value selection does not have an absolute standard, but it can be imagined that K is too large to improve the accuracy rate, and K nearest neighbor is an O (K * n) Complexity Algorithm, K is too large, and the algorithm efficiency will be lower.
Although the selection of K values will affect the results, some people will think that this algorithm is not stable. In fact, this effect is not very great, because only this effect has an impact on the category boundary, the effect on instances near the class center is very small. For such a new instance, K = 3, K = 5, and K = 11 are the same.
Finally, note that, when the dataset is not balanced, you may need to vote based on the proportion of each type, so that the accuracy rate of small classes will not be too low.