KNN Supplement:
1, the K value is set to how big?
K is too small, the classification results are susceptible to noise points,K is too large, the nearest neighbor may contain too many other categories of points.
(for distance weighting, the effect of K-value setting can be reduced)
The k value is usually determined by cross-examination ( k=1 as the benchmark)
rule of thumb:K is generally lower than the square root of the number of training samples
2, how to determine the most appropriate category?
The weighted voting method is more appropriate. And how to weighting, need to be based on specific business and data characteristics to explore
3, how to choose the right distance measurement?
The impact of high dimensions on distance measurement: It is well known that the more the number of variables, the more the Euclidean distance is less discriminating.
The effect of variable range on distance: The variable with the larger range is often dominated by the distance calculation, so the variables should be normalized first.
4. Should training samples be treated equally?
In the training set, some samples may be more worthy of reliance.
It can also be said that the quality of the sample data problem
Different weights can be applied to various samples to enhance the weight of dependent samples and reduce the impact of unreliable samples .
5, performance problems?
KNN is a lazy algorithm , usually do not study hard, test (the test sample classification) only cramming (temporarily to find K nearest neighbor).
The consequence of laziness: the construction model is simple, but the system overhead of classifying the test samples is large, because all training samples are scanned and distances are computed.
There are a number of ways to improve the efficiency of calculations, such as compressing training samples.
KNN Classification Algorithm Supplement