Introduction to KNN algorithm
KNN (K-Nearest Neighbor) nearest neighbor classification algorithm is one of the simplest algorithms in data mining classification (classification) technology. Its guiding ideology is "near Zhu Zhechi, close to ink black", that is, inferred by your neighbors Out of your category.
Simple Application Server
USD1.00 New User Coupon
* Only 3,000 coupons available.
* Each new user can only get one coupon(except users from distributors).
* The coupon is valid for 30 days from the date of receipt.
The implementation principle of the KNN nearest neighbor classification algorithm: In order to determine the category of the unknown sample, all the samples of the known category are used as a reference to calculate the distance between the unknown sample and all the known samples, and the K known samples with the closest distance to the unknown sample are selected. According to the majority-voting rule (majority-voting), the unknown sample and the K nearest samples are classified into one category.
The above is the basic principle of the KNN algorithm in the classification task. In fact, the meaning of the letter K is the number of the nearest sample instances to be selected. The K value of the KNN algorithm in scikit-learn is adjusted by the n_neighbors parameter. The default The value is 5.
How to judge which category the green circle should belong to, whether it belongs to the red triangle or the blue square? If K=3, since the proportion of the red triangle is 2/3, the green circle will be judged as belonging to the category of the red triangle. If K=5, since the proportion of the blue square is 3/5, the green circle will be judged It belongs to the blue square category.
Since the KNN nearest neighbor classification algorithm only determines the category of the sample to be classified based on the category of the nearest one or several samples in the classification decision, rather than relying on the method of discriminating the class domain to determine the category, so for the class domain KNN method is more suitable than other methods for sample sets to be divided with more crossover or overlap.
The key to the KNN algorithm:
(1) All features of the sample must be quantified comparatively
If there are non-numerical types in the sample characteristics, measures must be taken to quantify them as numerical values. For example, the sample feature contains color, and the distance calculation can be realized by converting the color to gray value.
(2) Sample features need to be normalized
The sample has multiple parameters, and each parameter has its own domain and value range, and their influence on distance calculation is different. For example, the influence of a larger value will overwhelm the parameter with a smaller value. Therefore, the sample parameters must be scaled. The simplest way is to normalize the values of all features.
(3) A distance function is needed to calculate the distance between two samples
The commonly used distance functions are: Euclidean distance, cosine distance, Hamming distance, Manhattan distance, etc. Euclidean distance is generally selected as the distance measure, but this is only applicable to continuous variables. In the case of discontinuous variables such as text classification, Hamming distance can be used as a measure. Under normal circumstances, if some special algorithms are used to calculate the metric, the accuracy of K nearest neighbor classification can be significantly improved, such as the use of large edge nearest neighbor method or nearest neighbor component analysis method.
Take the calculation of the distance between A(x1,y1) and B(x2,y2) in two-dimensional space as an example, the calculation method of Euclidean distance and Manhattan distance
(4) Determine the value of K
If the K value is too large, it is easy to cause underfitting, and too small is easy to overfit. Cross-validation is required to determine the K value.
Advantages of KNN algorithm:
1. Simple, easy to understand, easy to implement, no need to estimate parameters, no training;
2. Suitable for classifying rare events;
3. Especially suitable for multi-class problems (multi-modal, objects with multiple category labels), kNN performs better than SVM.
Disadvantages of KNN algorithm:
The main disadvantage of the KNN algorithm in classification is that when the sample is unbalanced, for example, the sample size of one class is very large, while the sample size of other classes is very small, which may cause when a new sample is input, K of the sample Large-volume samples in the neighbors account for the majority. The algorithm only calculates the nearest neighbor samples, and the number of samples of a certain type is large, then either such samples are not close to the target sample, or such samples are very close to the target sample. In any case, the quantity does not affect the results of the operation. It can be improved by using the weight method (the neighbor with a small distance from the sample has a large weight).
Another shortcoming of this method is the large amount of calculation, because for each text to be classified, the distance to all known samples must be calculated to find its K nearest neighbors.
The comprehensibility is poor and it is impossible to give rules like decision trees.
KNN algorithm implementation
It is not difficult to implement the KNN algorithm by yourself. There are three main steps:
Calculating distance: Given a sample to be classified, calculate the distance between it and each sample in the classified sample;
Find neighbors: delineate the K classified samples closest to the sample to be classified as the nearest neighbors of the sample to be classified;
Classify: Decide which class the sample to be classified belongs to according to the class to which most of the samples in the K neighbors belong;