KNN algorithm python implementation and simple digital recognition

Source: Internet
Author: User

KNN algorithm python implementation and simple digital recognition
Advantages and disadvantages of kNN algorithms: high accuracy, no sensitivity to abnormal values, and no input data assumption disadvantages: time complexity and space complexity are both high. applicable data range: Ideas of numeric and nominal algorithms: KNN algorithm (K-Nearest Neighbor Algorithm), the idea of the algorithm is very simple, simply put, it is a collection of things, that is, we find k from a bunch of known training sets closest to the target, then, let's see which of them is the most classification, and use this as the basis for classification. Function parsing: the Library Function tile () such as tile (A, n) is to repeat A n times a = np. array ([0, 1, 2]) np. tile (a, 2) array ([0, 1, 2, 0, 1, 2]) np. tile (a, (2, 2) array ([0, 1, 2, 0, 1, 2], [0, 1, 2, 0, 1, 2]) np. tile (a, (2, 1, 2) array ([[0, 1, 2, 0, 1, 2], [[0, 1, 2, 0, 1, 2]) B = np. array ([[1, 2], [3, 4]) np. tile (B, 2) array ([[1, 2, 1, 2], [3, 4, 3, 4]) np. tile (B, (2, 1) array ([[1, 2], [3, 4], [1, 2], [3, 4]) 'The self-implemented function createDataSet () generates the test array kNNclassif Y (inputX, dataSet, labels, k) classification function inputX input parameter dataSet training set labels training set label k Nearest Neighbor number copy code 1 # coding = UTF-8 2 from numpy import * 3 import operator 4 5 def createDataSet (): 6 group = array ([[1.0, 0.9], [1.0, 1.0], [0.1, 0.2], [0.0, 0.1]) 7 labels = ['A ', 'A', 'B', 'B'] 8 return group, labels 9 # inputX indicates the input vector (that is, we need to determine which type) 10 # dataSet indicates training sample 11 # label indicates the training sample label 12 # k is the nearest neighbor parameter, select the latest k 13 def kNNclassify (inputX, dataSet, Labels, k): 14 dataSetSize = dataSet. shape [0] # compute several training data 15 # Start to calculate Euclidean distance 16 diffMat = tile (inputX, (dataSetSize, 1 )) -dataSet17 18 sqDiffMat = diffMat ** 219 sqDistances = sqDiffMat. sum (axis = 1) # Add 20 distances = sqDistances to each vector in the matrix * 0.521 # The Euclidean distance is 22 sortedDistance = distances. argsort () 23 classCount = {} 24 for I in xrange (k): 25 voteLabel = labels [sortedDistance [I] 26 classCount [voteLabel] = classCo Unt. get (voteLabel, 0) + 127 res = max (classCount) 28 return res29 30 def main (): 31 group, labels = createDataSet () 32 t = kNNclassify ([], group, labels, 3) 33 print t34 35 if _ name __= = '_ main _': 36 main () 37. Copy the implementation dataset of The kNN application instance's handwriting recognition system: two datasets: training and test. The classification label is in the file name. Pixel 32*32. The data looks like this: Method: kNN is used, but the distance is complicated (1024 features), mainly to solve the problem of how to read data, you can directly call this method for comparison. Speed: the speed is still relatively slow. Here the dataset is: training 2000 +, test 900 + (i5 CPU) when k = 3, it will take 32 s + to copy the Code 1 # coding = UTF-8 2 from numpy import * 3 import operator 4 import OS 5 import time 6 7 def createDataSet (): 8 group = array ([[1.0, 0.9], [1.0, 1.0], [0.1, 0.2], [0.0, 0.1]) 9 labels = ['A ', 'A', 'B', 'B'] 10 return group, labels11 # inputX indicates the input vector (that is, we need to determine which type of vector it belongs) 12 # dataSet indicates training sample 13 # label indicates the training sample label 14 # k is the nearest neighbor parameter, select the latest k 15 def kNNclassify (input X, dataSet, labels, k): 16 dataSetSize = dataSet. shape [0] # Calculation of several training data 17 # Start to calculate Euclidean distance 18 diffMat = tile (inputX, (dataSetSize, 1)-dataSet19 # diffMat = inputX. repeat (dataSetSize, aixs = 1)-dataSet20 sqDiffMat = diffMat ** 221 sqDistances = sqDiffMat. sum (axis = 1) # add 22 distances = sqDistances to each vector in the matrix * 0.523 # The Euclidean distance is 24 sortedDistance = distances. argsort () 25 classCount = {} 26 for I in xrange (k): 27 voteLa Bel = labels [sortedDistance [I] 28 classCount [voteLabel] = classCount. get (voteLabel, 0) + 129 res = max (classCount) 30 return res31 32 def img2vec (filename): 33 returnVec = zeros () 34 fr = open (filename) 35 for I in range (32): 36 lineStr = fr. readline () 37 for j in range (32): 38 returnVec [0, 32 * I + j] = int (lineStr [j]) 39 return returnVec40 41 def handwritingClassTest (trainingFloder, testFloder, k): 42 hw Labels = [] 43 trainingFileList = OS. listdir (trainingFloder) 44 m = len (trainingFileList) 45 trainingMat = zeros (m, 1024) 46 for I in range (m ): 47 fileName = trainingFileList [I] 48 fileStr = fileName. split ('. ') [0] 49 classNumStr = int (fileStr. split ('_') [0]) 50 hwLabels. append (classNumStr) 51 trainingMat [I,:] = img2vec (trainingFloder + '/' + fileName) 52 testFileList = OS. listdir (testFloder) 53 errorCount = 0.05 4 mTest = len (testFileList) 55 for I in range (mTest): 56 fileName = testFileList [I] 57 fileStr = fileName. split ('. ') [0] 58 classNumStr = int (fileStr. split ('_') [0]) 59 vectorUnderTest = img2vec (testFloder + '/' + fileName) 60 classifierResult = kNNclassify (vectorUnderTest, trainingMat, hwLabels, K) 61 # print classifierResult, '', classNumStr62 if classifierResult! = ClassNumStr: 63 errorCount + = 164 print 'tatal error', errorCount65 print 'errorrate', errorCount/mTest66 67 def main (): 68 t1 = time. clock () 69 handwritingClassTest ('trainingdigits ', 'testdigits', 3) 70 t2 = time. clock () 71 print 'execute ', t2-t172 if _ name __= =' _ main _ ': 73 main () 74

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.