I. Overview
K Nearest neighbor (k-nearest NEIGHBOR,KNN) classification algorithm is a theoretically mature method and one of the simplest machine learning algorithms. The idea of this approach is that if a sample is in the K most similar in the feature space (that is, the nearest neighbor in the feature space) Most of the samples belong to a category, then the sample belongs to that category. In the KNN algorithm, the selected neighbors are the objects that have been correctly categorized. This method determines the category to which the sample is to be divided based on the category of the nearest one or several samples in the categorical decision-making. Although the KNN method relies on the limit theorem in principle, it is only related to a very small number of adjacent samples in the class decision. The KNN method is more suitable than other methods because the KNN method mainly relies on the surrounding finite sample, rather than the Discriminant class domain method to determine the category of the class.
Second, the principle
The basic idea is that if a sample in the feature space of K is most similar to the most adjacent in the feature space, most of the samples belong to a category, the sample also belongs to this category. This method determines the category to which the sample is to be divided, depending on the category of the nearest one or several samples in the classification decision.
Iii. Advantages and Disadvantages
1. Advantages:
1) High precision, insensitive to outliers, no data input assumptions;
2) KNN algorithm itself is simple and effective, it is a lazy-learning algorithm, classifier does not need to use training set for training, training time complexity of 0.
3) Since the KNN method mainly relies on the surrounding limited adjacent samples, rather than the discriminant domain method to determine the category, so for the class domain intersection or overlap more than the sample set, the KNN method is more suitable than other methods.
4) KNN algorithm can be used not only for classification, but also for regression.
2. Disadvantages:
1) High computational complexity: The computational complexity of KNN classification is proportional to the number of documents in the training set, i.e., if the total number of documents in the training set is N, then the classification time complexity of KNN is O (n).
2) High spatial complexity: when the sample is unbalanced, such as a class sample capacity is very large, and other class sample capacity is very small, it is possible that when a new sample is entered, the sample's K neighbors in the bulk class sample accounted for a majority.
Iv. issues of attention
1, the choice of K
The choice of K values can have a significant impact on the results of the algorithm. A smaller k value means that only training instances that are closer to the input instance will work on the predicted results, but are prone to overfitting; if the K value is large, the advantage is that it can reduce the learning estimation error, but the disadvantage is that the approximate error of learning increases, and the training instance which is farther from the input instance will also work on the prediction. In practice, K values generally choose a smaller value, usually using a cross-validation method to select the most K-value. With the number of training instances tending to infinity and k=1, the error rate does not exceed twice times the Bayesian error rate, and if K is also inclined to infinity, the error rate tends to be Bayesian error rate.
2. Distance measurement
Commonly used Euclidean distance, Markov distance, angle cosine distance and so on.
3. Classification Sample Balance
When the sample is unbalanced, such as a class with a large sample capacity, and other classes of sample capacity is very small, it is possible that when a new sample is entered, the sample of the K-neighbor of the samples of the large-capacity class majority. At this point, you can compress the data categories with more samples, or use the weighting coefficients to judge which category the test points belong to.
Five, the algorithm steps
1) Calculation Distance: Calculates the distance from the point in the data set of the known category to the current point;
2) Find a neighbor: Find the nearest K training object, as the nearest neighbor of the test object
3) Do classification: According to the category of the highest frequency of the K nearest neighbor as the prediction classification of the test object
Vi. Sample Code (Python)
From numpy Import *
Import operator
def classify0 (inx,dataset,labels,k):
Datasetsize = dataset.shape[0]
print ' datasetsize= ', datasetsize
Diffmat = Tile (InX, (datasetsize,1))-DataSet
Print tile (InX, (datasetsize,1))
print ' diffmat= ', Diffmat
Sqdiffmat = diffmat**2
print ' sqdiffmat= ', Sqdiffmat
Sqdistances = Sqdiffmat.sum (Axis=1)
print ' sqdistances= ', sqdistances
distances = sqdistances**0.5
print ' distances= ', distances
Sorteddistindicies = Distances.argsort ()
Print Sorteddistindicies
ClassCount = {}
For I in range (k):
Voteilabel = Labels[sorteddistindicies[i]]
Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1
Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true)
Print Sortedclasscount
return sortedclasscount[0][0]
Def createdataset ():
Group = Array ([[[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
Labels = [' A ', ' a ', ' B ', ' B ']
Return Group,labels
Group,labels = CreateDataSet ()
Print classify0 ([0,0],group,labels,3)
VII. Algorithm Industry Applications
Eight, the related improvement of the algorithm
KD tree, sample compression technology
Ix. Reference Documents
Http://www.stanford.edu/~hastie/Papers/dann_IEEE.pdf
Machine Learning II: K-Nearest neighbor (KNN) algorithm