KNN algorithm python implementation and simple digital recognition
Advantages and disadvantages of kNN algorithms: high accuracy, no sensitivity to abnormal values, and no input data assumption disadvantages: time complexity and space complexity are both high. applicable data range: Ideas of numeric and nominal algorithms: KNN algorithm (K-Nearest Neighbor Algorithm), the idea of the algorithm is very simple, simply put, it is a collection of things, that is, we find k from a bunch of known training sets closest to the target, then, let's see which of them is the most classification, and use this as the basis for classification. Function parsing: the Library Function tile () such as tile (A, n) is to repeat A n times a = np. array ([0, 1, 2]) np. tile (a, 2) array ([0, 1, 2, 0, 1, 2]) np. tile (a, (2, 2) array ([0, 1, 2, 0, 1, 2], [0, 1, 2, 0, 1, 2]) np. tile (a, (2, 1, 2) array ([[0, 1, 2, 0, 1, 2], [[0, 1, 2, 0, 1, 2]) B = np. array ([[1, 2], [3, 4]) np. tile (B, 2) array ([[1, 2, 1, 2], [3, 4, 3, 4]) np. tile (B, (2, 1) array ([[1, 2], [3, 4], [1, 2], [3, 4]) 'The self-implemented function createDataSet () generates the test array kNNclassif Y (inputX, dataSet, labels, k) classification function inputX input parameter dataSet training set labels training set label k Nearest Neighbor number copy code 1 # coding = UTF-8 2 from numpy import * 3 import operator 4 5 def createDataSet (): 6 group = array ([[1.0, 0.9], [1.0, 1.0], [0.1, 0.2], [0.0, 0.1]) 7 labels = ['A ', 'A', 'B', 'B'] 8 return group, labels 9 # inputX indicates the input vector (that is, we need to determine which type) 10 # dataSet indicates training sample 11 # label indicates the training sample label 12 # k is the nearest neighbor parameter, select the latest k 13 def kNNclassify (inputX, dataSet, Labels, k): 14 dataSetSize = dataSet. shape [0] # compute several training data 15 # Start to calculate Euclidean distance 16 diffMat = tile (inputX, (dataSetSize, 1 )) -dataSet17 18 sqDiffMat = diffMat ** 219 sqDistances = sqDiffMat. sum (axis = 1) # Add 20 distances = sqDistances to each vector in the matrix * 0.521 # The Euclidean distance is 22 sortedDistance = distances. argsort () 23 classCount = {} 24 for I in xrange (k): 25 voteLabel = labels [sortedDistance [I] 26 classCount [voteLabel] = classCo Unt. get (voteLabel, 0) + 127 res = max (classCount) 28 return res29 30 def main (): 31 group, labels = createDataSet () 32 t = kNNclassify ([], group, labels, 3) 33 print t34 35 if _ name __= = '_ main _': 36 main () 37. Copy the implementation dataset of The kNN application instance's handwriting recognition system: two datasets: training and test. The classification label is in the file name. Pixel 32*32. The data looks like this: Method: kNN is used, but the distance is complicated (1024 features), mainly to solve the problem of how to read data, you can directly call this method for comparison. Speed: the speed is still relatively slow. Here the dataset is: training 2000 +, test 900 + (i5 CPU) when k = 3, it will take 32 s + to copy the Code 1 # coding = UTF-8 2 from numpy import * 3 import operator 4 import OS 5 import time 6 7 def createDataSet (): 8 group = array ([[1.0, 0.9], [1.0, 1.0], [0.1, 0.2], [0.0, 0.1]) 9 labels = ['A ', 'A', 'B', 'B'] 10 return group, labels11 # inputX indicates the input vector (that is, we need to determine which type of vector it belongs) 12 # dataSet indicates training sample 13 # label indicates the training sample label 14 # k is the nearest neighbor parameter, select the latest k 15 def kNNclassify (input X, dataSet, labels, k): 16 dataSetSize = dataSet. shape [0] # Calculation of several training data 17 # Start to calculate Euclidean distance 18 diffMat = tile (inputX, (dataSetSize, 1)-dataSet19 # diffMat = inputX. repeat (dataSetSize, aixs = 1)-dataSet20 sqDiffMat = diffMat ** 221 sqDistances = sqDiffMat. sum (axis = 1) # add 22 distances = sqDistances to each vector in the matrix * 0.523 # The Euclidean distance is 24 sortedDistance = distances. argsort () 25 classCount = {} 26 for I in xrange (k): 27 voteLa Bel = labels [sortedDistance [I] 28 classCount [voteLabel] = classCount. get (voteLabel, 0) + 129 res = max (classCount) 30 return res31 32 def img2vec (filename): 33 returnVec = zeros () 34 fr = open (filename) 35 for I in range (32): 36 lineStr = fr. readline () 37 for j in range (32): 38 returnVec [0, 32 * I + j] = int (lineStr [j]) 39 return returnVec40 41 def handwritingClassTest (trainingFloder, testFloder, k): 42 hw Labels = [] 43 trainingFileList = OS. listdir (trainingFloder) 44 m = len (trainingFileList) 45 trainingMat = zeros (m, 1024) 46 for I in range (m ): 47 fileName = trainingFileList [I] 48 fileStr = fileName. split ('. ') [0] 49 classNumStr = int (fileStr. split ('_') [0]) 50 hwLabels. append (classNumStr) 51 trainingMat [I,:] = img2vec (trainingFloder + '/' + fileName) 52 testFileList = OS. listdir (testFloder) 53 errorCount = 0.05 4 mTest = len (testFileList) 55 for I in range (mTest): 56 fileName = testFileList [I] 57 fileStr = fileName. split ('. ') [0] 58 classNumStr = int (fileStr. split ('_') [0]) 59 vectorUnderTest = img2vec (testFloder + '/' + fileName) 60 classifierResult = kNNclassify (vectorUnderTest, trainingMat, hwLabels, K) 61 # print classifierResult, '', classNumStr62 if classifierResult! = ClassNumStr: 63 errorCount + = 164 print 'tatal error', errorCount65 print 'errorrate', errorCount/mTest66 67 def main (): 68 t1 = time. clock () 69 handwritingClassTest ('trainingdigits ', 'testdigits', 3) 70 t2 = time. clock () 71 print 'execute ', t2-t172 if _ name __= =' _ main _ ': 73 main () 74