Constructing handwritten recognition system by K-Nearest neighbor algorithm

Source: Internet
Author: User
Tags ack

For the sake of simplicity, the system constructed here only recognizes numbers 0 to 9, and the numbers that need to be identified are already using the graphics processing software, which is processed into the same color and size: a black-and-white image with a wide height of 32 pixels. Although storing images in text format does not make efficient use of memory space, we convert images to text format for ease of understanding.

---1. Collecting data: Providing text files

The data collection is modified from the data collection in the "optical recognition of handwritten digital datasets"-article published in the UCI Machine Learning database on October 3, 2010 in HTTP://ARCHIVE.ICS.UCI.EDU/ML.

---2. Preparing data: Converting an image to a test vector

The trainingdigits contains about 2000 examples, each of which has about 200 samples, and about 900 test data are included in the testdigits. Two sets of data do not overlap.

Let's first format the image as a vector. We convert a 32*32 binary image matrix to a vector of 1*1024.

We first write the function img2vector, convert the image to a vector: the function creates an 1*1024 numpy array, then opens the specified file, loops through the first 32 lines of the file, stores the first 32 character values of each row in the NumPy array, and returns the array.

#!/usr/bin/python#-*-coding:utf-8-*-from numpy Import * #引入科学计算包numpyfrom os import listdirimport operator #经典python函数库, operator module # algorithm core #inx: User-categorized input vectors, which are about to be categorized #dataset: Training Sample Set #labels: Tag vector def classifyo (inx,dataset,labels,k): #距离 Calculate datasetsize=dataset.shape[0] #得到数组的行数, that is known to have several training data diffmat=tile (InX, (datasetsize,1))-dataset #tile是numpy中的函数, Tile     Expands the original array into 4 identical arrays, diffmat the difference between the target and the training value sqdiffmat=diffmat**2 #各个元素分别平方 sqdistances=sqdiffmat.sum (Axis=1)     distances=sqdistances**0.5 #开方, get distance sorteddistindicies=distances.argsort () #升序排列 #选择距离最小的k个点 classcount={} For I in range (k): Voteilabel=labels[sorteddistindicies[i]] Classcount[voteilabel]=classcount.get (voteIl abel,0) +1 #排序 sortedclasscount=sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true) return sort        Edclasscount[0][0]def img2vector (filename): Returnvect=zeros ((1,1024)) Fr=open (filename) for I in range (32):  Linestr=fr.readline ()      For j in Range (+): Returnvect[0,32*i+j]=int (Linestr[j]) return Returnvect 

Test the Img2vector function by entering the following command on the Python command line, and then compare it to the file opened by this article editor:

>>> Import knn>>> testvector=knn.img2vector (' digits/testdigits/0_13.txt ') #根据自己的目录写 >>> Testvector[0,0:31]array ([0.,  0.,  0.,  0., 0., 0., 0., 0.,  0.,  0.,  0.,  0.,  0.,        0.,  1.,  1.,  1.,  1.,  0.,  0.,  0., 0., 0., 0., 0. ,  0.,        0.,  0.,  0.,  0.,  0.] >>> Testvector[0,32:63]array ([0.,  0.,  0.,  0.,  0.,  0.,  0.,  0  ., 0.,  0.,  0.,  0.,  1.,        1.,  1.,  1.,  1., 1., 1., 0., 0. ,  0.,  0.,  0.,  0.,  0.,        0.,  0.,  0.,  0.,  0.]
---3. Test algorithm: Using K-Nearest neighbor algorithm to recognize handwritten numerals

We've already processed the data into a format that the classifier can recognize, and now we're going to go into the classifier and check the results of the classifier's execution. Handwritingclasstest () is the code that tests the classifier and writes it to the knn.py file. Before writing, ensure that the From OS import Listdir is written to the starting part of the file. The main function of this code is to import the function Listdir from the OS module, which can list the file name of the given directory.

Def handwritingclasstest (): hwlabels=[] Trainingfilelist=listdir (' E:\\python excise\\digits\\trainingdigits ') m=l En (Trainingfilelist) Trainingmat=zeros ((m,1024)) for I in Range (m): Filenamestr=trainingfilelist[i] fi Lestr=filenamestr.split ('. ') [0] Classnumstr=int (filestr.split ('_') [0]) hwlabels.append (CLASSNUMSTR) trainingmat[i,:]=img2vector ('    digits/trainingdigits/%s '%filenamestr) testfilelist=listdir (' E:/python excise/digits/testdigits ') errorCount=0.0 Mtest=len (testfilelist) for I in Range (mtest): Filenamestr=testfilelist[i] Filestr=filenamestr.split ('. ')        ) [0] Classnumstr=int (filestr.split ('_') [0]) vectorundertest=img2vector (' digits/testdigits/%s '%filenamestr) Classifierresult=classifyo (vectorundertest,trainingmat,hwlabels,3) print "The classifier came back with:%d,th E Real answeris:%d "% (CLASSIFIERRESULT,CLASSNUMSTR) if (Classifierresult!=classnumstr): errorcount+=1.0 Print "\nthe total number of the error is:%d"%errorcount print "\nthe total error rate is:%f"% (Errorcount/float (mtest)) 

Explanation: Store the contents of the file in the E:\\python excise\\digits\\trainingdigits directory in the list trainingfilelist, and then you can get the number of files in the file and store them in the variable m. Next, the code creates a training matrix of M row 1024 columns, where each row of data stores an image. We can parse out the categorical number from the file name, the files under the directory are named according to the rules, such as the classification of the file 9_45.txt is 9, it is the 45th instance of the number 9. Then we can store the class code in the Hwlabels vector, using the previous Img2vector function to load the image.

In the next step, we perform similar operations on files in the E:/python excise/digits/testdigits directory, but instead of loading the files in this directory into the matrix, we use the classifyo () function to test each file in that directory. Since the values in the file are already between 0 and 1, no normalization is necessary.

In the Python command prompt, enter knn.handwritingclasstest () to test the output of the function. Depending on the speed of the machine, it may take a long time to clamp the data set, and then the function tests each file in turn:

>>> knn.handwritingclasstest () The classifier came back with:0,the real answeris:0the classifier came back with:0 , the real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0 The classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the classifier came  Back with:0,the Real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real Answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real Answeris:0the class Ifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with : 0,the Real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real Answeris : 0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the classifier Cam E Back with:0,the Real anSweris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the Classifi Er came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0, The real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0t He classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the classifier came B Ack with:0,the Real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real Answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the Classi Fier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with: 0,the Real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real Answeris: 0the Classifier came BACK with:0,the Real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real an Sweris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the Classifi Er came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0, The real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0t He classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the classifier came B Ack with:0,the Real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real Answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the Classi Fier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with: 0,the Real Answeris:0the Classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the classifier came BAC K with:0,the Real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real an Sweris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the Classifi Er came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0, The real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0t He classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the classifier came B Ack with:0,the Real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real Answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the Classi Fier came back with:0,thE Real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the Classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the classifier came BAC K with:0,the Real answeris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real an Sweris:0the classifier came back with:0,the real answeris:0the classifier came back with:0,the real answeris:0the Classifi Er came back with:1,the real answeris:1the classifier came back with:1,the real answeris:1the classifier came back with:1, The real answeris:1the classifier came back with:1,the real answeris:1...the classifier came back with:9,the real Answeris : 9the classifier came back with:9,the real answeris:9the classifier came back with:9,the real answeris:9the classifier Cam E back with:9,the Real answeris:9the classifier came back with:9,the real answeris:9the classifier came back with:9,the re Al Answeris:9the ClassifIer came back with:9,the real answeris:9the classifier came back with:9,the real answeris:9the classifier came back With:9 , the real answeris:9the classifier came back with:9,the real answeris:9the classifier came back with:9,the real Answeris:9 The classifier came back with:9,the real answeris:9the classifier came back with:9,the real answeris:9the classifier came  Back with:9,the Real answeris:9the classifier came back with:9,the real answeris:9the total number of error is:11the total Error Rate is:0.011628

Summarize
The K-Nearest neighbor algorithm recognizes handwritten digital datasets with an error rate of 1.2%. The error rate of K-nearest neighbor algorithm can be affected by changing the value of k, modifying the function handwritingclasstest randomly selecting training samples and changing the number of training samples.

 

Constructing handwritten recognition system by K-Nearest neighbor algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.