KNN algorithm python implementation and simple digital recognition, knn algorithm python Recognition

Source: Internet
Author: User

KNN algorithm python implementation and simple digital recognition, knn algorithm python Recognition
Advantages and disadvantages of kNN algorithm:

  • Advantages: high precision, insensitive to abnormal values, no input data assumptions
  • Disadvantage: both time complexity and space complexity are high.
  • Applicable data range: numeric and nominal
Algorithm ideas:

KNN algorithm (K-Nearest Neighbor Algorithm), the idea of the algorithm is very simple, simply put, it is a collection of things, that is, we find k from a bunch of known training sets closest to the target, then, let's see which of them is the most classification, and use this as the basis for classification.

Function parsing: database functions
  • tile()

    For exampletile(A,n)Is to repeat A n times

a = np.array([0, 1, 2])np.tile(a, 2)array([0, 1, 2, 0, 1, 2])np.tile(a, (2, 2))array([[0, 1, 2, 0, 1, 2],[0, 1, 2, 0, 1, 2]])np.tile(a, (2, 1, 2))array([[[0, 1, 2, 0, 1, 2]],[[0, 1, 2, 0, 1, 2]]])b = np.array([[1, 2], [3, 4]])np.tile(b, 2)array([[1, 2, 1, 2],[3, 4, 3, 4]])np.tile(b, (2, 1))array([[1, 2],[3, 4],[1, 2],[3, 4]])`
Self-implemented Functions

createDataSet()Generate Test Array
kNNclassify(inputX, dataSet, labels, k)Classification functions

  • InputX input parameters
  • DataSet training set
  • Label of the labels training set
  • Number of k nearest neighbors
    1. 1 # coding = UTF-8 2 from numpy import * 3 import operator 4 5 def createDataSet (): 6 group = array ([1.0, 0.9], [1.0, 1.0], [0.1, 0.2], [0.0, 0.1]) 7 labels = ['A', 'A', 'B', 'B'] 8 return group, labels 9 # inputX indicates the input vector (that is, we need to determine which class it belongs) 10 # dataSet indicates training sample 11 # label indicates the training sample label 12 # k is the nearest neighbor parameter. Select the latest k 13 def kNNclassify (inputX, dataSet, labels, k ): 14 dataSetSize = dataSet. shape [0] # compute several training data 15 # Start to calculate Euclidean distance 16 diffMat = tile (inputX, (dataSetSize, 1 )) -dataSet17 18 sqDiffMat = diffMat ** 219 sqDistances = sqDiffMat. sum (axis = 1) # Add 20 distances = sqDistances to each vector in the matrix * 0.521 # The Euclidean distance is 22 sortedDistance = distances. argsort () 23 classCount = {} 24 for I in xrange (k): 25 voteLabel = labels [sortedDistance [I] 26 classCount [voteLabel] = classCount. get (voteLabel, 0) + 127 res = max (classCount) 28 return res29 30 def main (): 31 group, labels = createDataSet () 32 t = kNNclassify ([], group, labels, 3) 33 print t34 35 if _ name __= = '_ main _': 36 main () 37

       


Implementation dataset of The kNN application instance Handwriting Recognition System:
Two datasets: training and test. The classification label is in the file name. Pixel 32*32. The data looks like this:
Method:
KNN is used, but this distance is complicated (1024 features). It is mainly used to solve the problem of how to read data. It can be called directly in comparison.
Speed:
The speed is still relatively slow. Here the dataset is: training 2000 +, test 900 + (i5 CPU) k = 3, 32 s +
  1. 1 # coding = UTF-8 2 from numpy import * 3 import operator 4 import OS 5 import time 6 7 def createDataSet (): 8 group = array ([1.0, 0.9], [1.0, 1.0], [0.1, 0.2], [0.0, 0.1]) 9 labels = ['A', 'A', 'B ', 'B'] 10 return group, labels11 # inputX indicates the input vector (that is, we need to determine which type of object it belongs) 12 # dataSet indicates the training sample 13 # label indicates the training sample label 14 # k is the nearest neighbor parameter. Select the latest k 15 def kNNclassify (inputX, dataSet, labels, k ): 16 dataSetSize = dataSet. shape [0] # computing training data 17 # Start to calculate the Euclidean distance 18 diffMat = tile (inputX, (dataSetSize, 1)-dataSet19 # diffMat = inputX. repeat (dataSetSize, aixs = 1)-dataSet20 sqDiffMat = diffMat ** 221 sqDistances = sqDiffMat. sum (axis = 1) # add 22 distances = sqDistances to each vector in the matrix * 0.523 # The Euclidean distance is 24 sortedDistance = distances. argsort () 25 classCount = {} 26 for I in xrange (k): 27 voteLabel = labels [sortedDistance [I] 28 classCount [voteLabel] = classCount. Get (voteLabel, 0) + 129 res = max (classCount) 30 return res31 32 def img2vec (filename): 33 returnVec = zeros () 34 fr = open (filename) 35 for I in range (32): 36 lineStr = fr. readline () 37 for j in range (32): 38 returnVec [0, 32 * I + j] = int (lineStr [j]) 39 return returnVec40 41 def handwritingClassTest (trainingFloder, testFloder, k): 42 hwLabels = [] 43 trainingFileList = OS. listdir (trainingFloder) 44 m = le N (trainingFileList) 45 trainingMat = zeros (m, 1024) 46 for I in range (m): 47 fileName = trainingFileList [I] 48 fileStr = fileName. split ('. ') [0] 49 classNumStr = int (fileStr. split ('_') [0]) 50 hwLabels. append (classNumStr) 51 trainingMat [I,:] = img2vec (trainingFloder + '/' + fileName) 52 testFileList = OS. listdir (testFloder) 53 errorCount = 0.054 mTest = len (testFileList) 55 for I in range (mTest): 56 fileName = te StFileList [I] 57 fileStr = fileName. split ('. ') [0] 58 classNumStr = int (fileStr. split ('_') [0]) 59 vectorUnderTest = img2vec (testFloder + '/' + fileName) 60 classifierResult = kNNclassify (vectorUnderTest, trainingMat, hwLabels, K) 61 # print classifierResult, '', classNumStr62 if classifierResult! = ClassNumStr: 63 errorCount + = 164 print 'tatal error', errorCount65 print 'errorrate', errorCount/mTest66 67 def main (): 68 t1 = time. clock () 69 handwritingClassTest ('trainingdigits ', 'testdigits', 3) 70 t2 = time. clock () 71 print 'execute ', t2-t172 if _ name __= =' _ main _ ': 73 main () 74

     


 



From Weizhi note (Wiz)



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.