An algorithm principle: a set of training samples is known, where each training sample has its own tag (label), that is, we know the corresponding relationship between each sample data in the sample set and the owning category. After you enter new data that is not marked, compare each feature of the new data to the one that corresponds to the data in the sample set, and then extract the classification marks for the most similar data in the sample set. In general, we select the first k most similar data classification labels in the sample set, where the most frequently occurring categories are labeled as our new data classification. Simply put, the k_ nearest neighbor algorithm uses the distance method of measuring different eigenvalues to classify.
Algorithm advantages: High precision, insensitive to outliers, no data input assumptions.
Algorithm disadvantage: The computational time-space complexity is high due to the calculation of the corresponding feature distance between each data feature to be classified and each sample in the sample set.
Implementation of two algorithms (handwriting recognition)
1. Data preparation: Using 32*32 pixels of black and white images (0-9, about 200 samples per digit, trainingdigits for data classifier training, testdigits for testing), here to facilitate understanding, the image is converted into text format.
2. Code implementation:
Convert the picture to a vector, we convert a 32*32 binary image matrix into a 1*1024 vector, write a function vector2d, the following code
1def vector2d (filename): 2 rows = 3 cols = 4 imgvector = Zeros ((1,rows * cols)) 5 open(filename) 6 For row in xrange (rows): 7 linestr = Filein. ReadLine() 8for col in Xrange (cols): 9 imgvector[0,row *32 + col] = Int (Linestr[col]) return imgvector
View Code
Trainingdata set and testdata set loading
1‘‘‘LoadDataSet ""2 defLoaddataset ():3 Print‘.... Getting TrainingData‘4Datasetdir = ' d:/pythoncode/mlcode/knn/'5Trainingfilelist =OS. Listdir (Datasetdir + ' trainingdigits ')6NumSamples = Len (trainingfilelist)7 8train_x = Zeros ((numsamples,1024))9train_y = []TenFor I in Xrange (numsamples): Onefilename = Trainingfilelist[i] ATrain_x[i,:] = vector2d (datasetdir + ' trainingdigits/%s '%filename) -label = Int (filename.Split(‘_‘) [0]) -Train_y.append (label) the‘‘‘ .... Getting TestingData...‘‘‘ - Print‘.... Getting TestingData...‘ -Testfilelist =OS. Listdir (Datasetdir + ' testdigits ') -NumSamples = Len (testfilelist) +test_x = Zeros ((numsamples,1024)) -Test_y = [] +For I in Xrange (numsamples): Afilename = Testfilelist[i] atTest_x[i,:] = vector2d (datasetdir + ' testdigits/%s '%filename) -label = Int (filename.Split(‘_‘) [0]) -Test_y.append (label) - - returnTrain_x,train_y,test_x,test_y
View Code
Construction of Classifiers
1From NumPyImport*2 3 Import OS4 5 defKnnclassify (newinput,dataset,labels,k):6NumSamples = dataset.shape[0]7 8diff = Tile (newinput, (numsamples,1))-DataSet9Squareddiff = diff * * 2Tensquareddist = SUM (Squareddiff,axis = 1) OneDistance = squareddist * * 0.5 A -Sorteddistindex = Argsort (distance) - theClassCount = {} -For I in Xrange (k): -Votedlabel = Labels[sorteddistindex[i]] -Classcount[votedlabel] = Classcount.get (votedlabel,0) + 1 + -MaxValue = 0 +For Key,value in Classcount.items (): A ifMaxValue < value: atMaxValue = value -Maxindex = key
View Code
Classification test
1 defTesthandwritingclass ():2 Print‘Load Data....‘3train_x,train_y,test_x,test_y = Loaddataset ()4 Print' Training ... '5 6 Print' Testing '7Numtestsamples = test_x.shape[0]8MatchCount = 0.09For I in Xrange (numtestsamples):Tenpredict = Knnclassify (test_x[i],train_x,train_y,3) One ifPredict! = Test_y[i]: A - Print' The predict is ', predict, ' the target value is ', Test_y[i] - the ifpredict = = Test_y[i]: -MatchCount + = 1 -accuracy = float (matchcount)/numtestsamples - + Print' The accuracy is:%.2f%% '% (accuracy * 100)
View Code
Test results
1Testhandwritingclass ()2 Load Data....3.... Getting TrainingData4.... Getting TestingData...5Training ....6Testing7The predict is 7 the target value is 18The predict is 9 the target value is 39The predict is 9 the target value is 3TenThe predict is 3 the target value is 5 OneThe predict is 6 the target value is 5 AThe predict is 6 the target value is 8 -The predict is 3 the target value is 8 -The predict is 1 the target value is 8 theThe predict is 1 the target value is 8 -The predict is 1 the target value is 9 -The predict is 7 the target value is 9 -The Accuracy is:98.84%
View Code
Note: The above code running environment is Python2.7.11
From the above results can be seen the KNN classification effect is also good, in my opinion, KNN is simple rough, is the unknown classification of data characteristics and we classify the data characteristics of the comparison, select the most similar mark as their own classification, spicy problem came, if our new data characteristics in the sample set is relatively rare, At this point, the possibility of classification error is very large, on the contrary, if the sample concentration of a class of sample more, then the new data will be divided into the possibility of the class, how to ensure the fairness of the classification, we need to be weighted.
Data Source: http://download.csdn.net/download/qq_17046229/7625323
Simple implementation of KNN algorithm