KNN algorithm python implementation and simple digital recognition, knn algorithm python Recognition
Advantages and disadvantages of kNN algorithm:
- Advantages: high precision, insensitive to abnormal values, no input data assumptions
- Disadvantage: both time complexity and space complexity are high.
- Applicable data range: numeric and nominal
Algorithm ideas:
KNN algorithm (K-Nearest Neighbor Algorithm), the idea of the algorithm is very simple, simply put, it is a collection of things, that is, we find k from a bunch of known training sets closest to the target, then, let's see which of them is the most classification, and use this as the basis for classification.
Function parsing: database functions
a = np.array([0, 1, 2])np.tile(a, 2)array([0, 1, 2, 0, 1, 2])np.tile(a, (2, 2))array([[0, 1, 2, 0, 1, 2],[0, 1, 2, 0, 1, 2]])np.tile(a, (2, 1, 2))array([[[0, 1, 2, 0, 1, 2]],[[0, 1, 2, 0, 1, 2]]])b = np.array([[1, 2], [3, 4]])np.tile(b, 2)array([[1, 2, 1, 2],[3, 4, 3, 4]])np.tile(b, (2, 1))array([[1, 2],[3, 4],[1, 2],[3, 4]])`
Self-implemented Functions
createDataSet()
Generate Test Array
kNNclassify(inputX, dataSet, labels, k)
Classification functions
- InputX input parameters
- DataSet training set
- Label of the labels training set
- Number of k nearest neighbors
1 # coding = UTF-8 2 from numpy import * 3 import operator 4 5 def createDataSet (): 6 group = array ([1.0, 0.9], [1.0, 1.0], [0.1, 0.2], [0.0, 0.1]) 7 labels = ['A', 'A', 'B', 'B'] 8 return group, labels 9 # inputX indicates the input vector (that is, we need to determine which class it belongs) 10 # dataSet indicates training sample 11 # label indicates the training sample label 12 # k is the nearest neighbor parameter. Select the latest k 13 def kNNclassify (inputX, dataSet, labels, k ): 14 dataSetSize = dataSet. shape [0] # compute several training data 15 # Start to calculate Euclidean distance 16 diffMat = tile (inputX, (dataSetSize, 1 )) -dataSet17 18 sqDiffMat = diffMat ** 219 sqDistances = sqDiffMat. sum (axis = 1) # Add 20 distances = sqDistances to each vector in the matrix * 0.521 # The Euclidean distance is 22 sortedDistance = distances. argsort () 23 classCount = {} 24 for I in xrange (k): 25 voteLabel = labels [sortedDistance [I] 26 classCount [voteLabel] = classCount. get (voteLabel, 0) + 127 res = max (classCount) 28 return res29 30 def main (): 31 group, labels = createDataSet () 32 t = kNNclassify ([], group, labels, 3) 33 print t34 35 if _ name __= = '_ main _': 36 main () 37
Implementation dataset of The kNN application instance Handwriting Recognition System:
Two datasets: training and test. The classification label is in the file name. Pixel 32*32. The data looks like this:
Method:
KNN is used, but this distance is complicated (1024 features). It is mainly used to solve the problem of how to read data. It can be called directly in comparison.
Speed:
The speed is still relatively slow. Here the dataset is: training 2000 +, test 900 + (i5 CPU) k = 3, 32 s +
1 # coding = UTF-8 2 from numpy import * 3 import operator 4 import OS 5 import time 6 7 def createDataSet (): 8 group = array ([1.0, 0.9], [1.0, 1.0], [0.1, 0.2], [0.0, 0.1]) 9 labels = ['A', 'A', 'B ', 'B'] 10 return group, labels11 # inputX indicates the input vector (that is, we need to determine which type of object it belongs) 12 # dataSet indicates the training sample 13 # label indicates the training sample label 14 # k is the nearest neighbor parameter. Select the latest k 15 def kNNclassify (inputX, dataSet, labels, k ): 16 dataSetSize = dataSet. shape [0] # computing training data 17 # Start to calculate the Euclidean distance 18 diffMat = tile (inputX, (dataSetSize, 1)-dataSet19 # diffMat = inputX. repeat (dataSetSize, aixs = 1)-dataSet20 sqDiffMat = diffMat ** 221 sqDistances = sqDiffMat. sum (axis = 1) # add 22 distances = sqDistances to each vector in the matrix * 0.523 # The Euclidean distance is 24 sortedDistance = distances. argsort () 25 classCount = {} 26 for I in xrange (k): 27 voteLabel = labels [sortedDistance [I] 28 classCount [voteLabel] = classCount. Get (voteLabel, 0) + 129 res = max (classCount) 30 return res31 32 def img2vec (filename): 33 returnVec = zeros () 34 fr = open (filename) 35 for I in range (32): 36 lineStr = fr. readline () 37 for j in range (32): 38 returnVec [0, 32 * I + j] = int (lineStr [j]) 39 return returnVec40 41 def handwritingClassTest (trainingFloder, testFloder, k): 42 hwLabels = [] 43 trainingFileList = OS. listdir (trainingFloder) 44 m = le N (trainingFileList) 45 trainingMat = zeros (m, 1024) 46 for I in range (m): 47 fileName = trainingFileList [I] 48 fileStr = fileName. split ('. ') [0] 49 classNumStr = int (fileStr. split ('_') [0]) 50 hwLabels. append (classNumStr) 51 trainingMat [I,:] = img2vec (trainingFloder + '/' + fileName) 52 testFileList = OS. listdir (testFloder) 53 errorCount = 0.054 mTest = len (testFileList) 55 for I in range (mTest): 56 fileName = te StFileList [I] 57 fileStr = fileName. split ('. ') [0] 58 classNumStr = int (fileStr. split ('_') [0]) 59 vectorUnderTest = img2vec (testFloder + '/' + fileName) 60 classifierResult = kNNclassify (vectorUnderTest, trainingMat, hwLabels, K) 61 # print classifierResult, '', classNumStr62 if classifierResult! = ClassNumStr: 63 errorCount + = 164 print 'tatal error', errorCount65 print 'errorrate', errorCount/mTest66 67 def main (): 68 t1 = time. clock () 69 handwritingClassTest ('trainingdigits ', 'testdigits', 3) 70 t2 = time. clock () 71 print 'execute ', t2-t172 if _ name __= =' _ main _ ': 73 main () 74
From Weizhi note (Wiz)