1. Introduction

KNN is a classification, the main areas of application is the recognition of unknown things, that is, to judge the unknown what kind of things, the judgment is, based on Euclid theorem, Judging the characteristics of unknown things and what kind of known things the closest feature. The idea of this approach is that if a sample is in the K most similar in the feature space (that is, the nearest neighbor in the feature Space) most of the samples belong to a category, then the sample belongs to that Category. In the KNN algorithm, the selected neighbors are the objects that have been correctly categorized. This method determines the category to which the sample is to be divided based on the category of the nearest one or several samples in the categorical decision-making. Although the KNN method relies on the limit theorem in principle, it is only related to a very small number of adjacent samples in the class Decision. The KNN method is more suitable than other methods because the KNN method mainly relies on the surrounding finite sample, rather than the Discriminant class domain method to determine the category of the class.

2. model

The three elements of KNN are: distance measurement, k-value selection, classification decision; The commonly used distance is LP distance, the expression is: when p=1, for the Manhattan distance, p=2, for the Euclidean distance, p for Infinity. K value selection: The smaller the k, the smaller the approximate error, the larger the estimated error, the more complex the model, over-fitting, the larger the K value, the smaller the error, the greater the approximate error, the simpler the model, the less fitting; k-values are generally obtained by cross-validation, generally small. Classification rules, commonly used rules are the majority of voting rules, at this time, the use of 0-1 loss, the probability of mis-classification, so that p minimum, equivalent to the minimum empirical risk.

In practice, linear scanning and kd trees are generally used to achieve, kd tree is a binary tree, the k-dimensional space is divided, the use of iterative and axis perpendicular to the plane to divide, each division to select the axis of all the data of the median of the Division.

The KD tree is used in knn, which is divided into tectonic process and search process, and the construction process is based on its partitioning criteria to construct a binary tree. The search process is as Follows:

(1) first from the root node `q`

to find the included leaf node recursively, each layer is to find the corresponding`xi`

(2) This leaf node is considered to be the current "approximate nearest point"

(3) recursive upward fallback, if a sphere with a `q`

radius of approximate nearest point intersects with the boundary of the other half of the root node in the center of the circle, then the other half of the sub-region has `q`

a closer point, then it goes into another sub-region to find the point and updates the approximate nearest point

(4) Repeat 3 steps until another sub-region does not intersect the sphere or returns the root node

(5) Last updated "approximate nearest point" with `q`

true nearest point

3. Summary

From the above simple introduction and understanding of the principle, the computational complexity of KNN is O (logn), the algorithm is applicable to the number of instances is much larger than the degree of dimensionality (number of attributes). From the algorithm complexity and effect to analyze, KNN is a relatively efficient algorithm, the following can be analyzed: the target sample is x, the nearest neighbor is z, then the error probability is, by means of Bayesian optimal classifier results, Knn's generalization error rate does not exceed the Bayesian optimal classifier twice Times.

Related Blogs are: http://blog.csdn.net/jmydream/article/details/8644004,https://my.oschina.net/u/1412321/blog/194174

Here are some implementations of the Code:

fromNumPyImport*Importoperator fromOsImportListdirdefclassify0 (inX, dataSet, labels, k): datasetsize=dataset.shape[0] Diffmat= Tile (inX, (datasetsize,1))-DataSet Sqdiffmat= Diffmat**2sqdistances= Sqdiffmat.sum (axis=1) Distances= sqdistances**0.5sorteddistindicies=distances.argsort () ClassCount={} forIinchRange (k): Voteilabel=labels[sorteddistindicies[i]] classcount[voteilabel]= Classcount.get (voteilabel,0) + 1Sortedclasscount= sorted (classcount.iteritems (), key=operator.itemgetter (1), reverse=True)returnsortedclasscount[0][0]defcreatedataset (): Group= Array ([[[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) Labels= ['A','A','B','B'] returngroup, LabelsdefFile2matrix (filename): FR=Open (filename) NumberOfLines= Len (fr.readlines ())#get the number of lines in the fileReturnmat = Zeros ((numberoflines,3))#prepare matrix to returnClasslabelvector = []#Prepare labels returnFR =Open (filename) Index=0 forLineinchFr.readlines (): line=Line.strip () listfromline= Line.split ('\ t') returnmat[index,:]= Listfromline[0:3] Classlabelvector.append (int (listfromline[-1]) Index+ = 1returnReturnmat,classlabelvectordefautonorm (dataSet): minvals=dataset.min (0) maxvals=Dataset.max (0) Ranges= maxvals-minvals Normdataset=zeros (shape (dataSet)) m=dataset.shape[0] Normdataset= Dataset-tile (minvals, (m,1)) Normdataset= Normdataset/tile (ranges, (m,1))#element wise divide returnnormdataset, ranges, minvalsdefdatingclasstest (): hoRatio= 0.50#Hold out 10%Datingdatamat,datinglabels = File2matrix ('DatingTestSet2.txt')#Load Data setfrom filenormmat, ranges, minvals =autonorm (datingdatamat) m=normmat.shape[0] numtestvecs= int (m*HoRatio) Errorcount= 0.0 forIinchRange (numtestvecs): Classifierresult= Classify0 (normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3) Print "The classifier came back with:%d, the real answer is:%d"%(classifierresult, datinglabels[i])if(classifierresult! = datinglabels[i]): Errorcount + = 1.0Print "the Total Error Rate is:%f"% (errorcount/float (NUMTESTVECS))PrintErrorcountdefimg2vector (filename): Returnvect= Zeros ((1,1024)) FR=Open (filename) forIinchRange (32): linestr=Fr.readline () forJinchRange (32): returnvect[0,32*i+j] =int (linestr[j])returnReturnvectdefhandwritingclasstest (): hwlabels=[] trainingfilelist= Listdir ('trainingdigits')#load the training setm =Len (trainingfilelist) Trainingmat= Zeros ((m,1024)) forIinchRange (m): Filenamestr=trainingfilelist[i] filestr= Filenamestr.split ('.') [0]#take OFF. txtclassnumstr = int (filestr.split ('_') [0]) hwlabels.append (classnumstr) trainingmat[i,:]= Img2vector ('trainingdigits/%s'%Filenamestr) testfilelist= Listdir ('testdigits')#iterate through the test setErrorcount = 0.0mtest=Len (testfilelist) forIinchRange (mtest): Filenamestr=testfilelist[i] filestr= Filenamestr.split ('.') [0]#take OFF. txtclassnumstr = int (filestr.split ('_') [0]) vectorundertest= Img2vector ('testdigits/%s'%Filenamestr) Classifierresult= Classify0 (vectorundertest, trainingmat, hwlabels, 3) Print "The classifier came back with:%d, the real answer is:%d"%(classifierresult, Classnumstr)if(classifierresult! = classnumstr): Errorcount + = 1.0Print "\nthe total number of errors is:%d"%ErrorcountPrint "\nthe total error rate is:%f"% (errorcount/float (mtest))

KNN (k-nearest Neighbor) algorithm