Directory
1. Application Introduction
1.1 Introduction to the experimental environment
1.2 Application Background Introduction
2. Data sources and preprocessing
2.1 Data sources and formats
2.2 Data preprocessing
3. Algorithm design and implementation
3.1 Handwriting recognition system algorithm implementation process
Implementation of 3.2 K nearest neighbor algorithm
3.3 Handwriting recognition system implementation
3.4 Algorithm Improvement and optimization
4. System operation process and results display
1. Application Introduction
1.1 Introduction to the experimental environment
This experiment is mainly done using Python language, Python version is 2.7, and uses the NumPy function library to do some numerical calculation and processing.
1.2 Application Background Introduction
This experiment is to achieve a simple handwriting recognition system, that is, according to the user input handwritten photos can be identified by the number of handwritten numerals. The input handwriting is a handwritten numeral, which is made of 0,1 numbers. The completion of handwriting recognition using K-nearest neighbor algorithm, K-Nearest neighbor algorithm design is simple, easy to implement, and the specific problem classification of the effect is better. Therefore, this experiment chooses K nearest neighbor to classify and recognize handwritten characters, and it is also the mining process of handwritten image feature data.
2. Data sources and preprocessing
2.1 Data sources and formats
The data collection is modified from the data collection in the "optical recognition of handwritten digital datasets" article, which is published in the UCI database on October 3, 2010 in HTTP://ARCHIVE.ICS.UCI.EDU/ML.
For the sake of simplicity, the system constructed here only recognizes numbers 0 through 9. The numbers that need to be identified have been processed into the same color and size using the graphics processing software: The wide-high is a black-and-white image of 32 pixels *32 pixels. Although storing images in text format does not make efficient use of memory space, we convert images to text format for ease of understanding.
Figure: Data Source: Handwriting data format
2.2 Data preprocessing
First, the image is converted to a test vector, and the experiment uses about 2000 examples, each of which has approximately 200 samples per figure, as shown in the previous illustration, and the directory testdigits contains about 900 test data. We use the data in catalog Trainingdigits to train the classifier and use the data in catalog Testdigits to test the effect of the classifier. Two sets of data are not overwritten. Some of the data are as follows:
Figure: Data Catalog
The image is formatted as a vector. Converts a 32*32 binary image matrix to a vector of 1 x 1024 so that the classifier used in the first two sections can process the digital image information.
First write a function img2vector convert the image to a vector: the function creates a 1*1024 numpy array, then opens the given file, loops through the first 32 lines of the file, stores the first 32 character values of each row in the NumPy array, and returns the array.
3. algorithm design and implementation process of 3.1 Handwritten recognition system algorithm
(1) Collect data: Provide text files.
(2) Prepare the data: Write the function clasify0 () and convert the image format to the list format used by the classifier.
(3) Analyze data: Check the data at the Python command prompt to make sure it meets the requirements.
(4 test algorithm: The writing function uses the provided part of the DataSet as a test sample, the difference between a test sample and a non-test sample is that the test sample is the data that has been sorted, and if the forecast classification differs from the actual category, it is marked as an error. implementation of 3.2 K nearest neighbor algorithm
The implementation process of the K nearest neighbor algorithm is:
(1) Calculate the distance between the point in the data set of the known category and the current point;
(2) Sorting in ascending order of distance;
(3) Select K points with the minimum distance from the current point;
(4) Determine the frequency of occurrence of the category of the first k points;
(5) Returns the category with the highest frequency of the first K points as the predicted classification of the current point.
In this system, K nearest neighbor algorithm is implemented in Classify0 function.
The CLASSIFYO () function has 4 input parameters: the transmission vector used for classification is INX, and the training sample set of the loser is a dataset
The label vector is labels, and the last parameter meaning is used to select the number of nearest neighbors, where the number of elements in the label vector is the same as the number of rows in the matrix dataset. The program uses Euclidean distance formula to calculate the distance between two vector points xa and xb.
After calculating the distance between all points, you can sort the data in small to large order. Then, determine the first k distances
The primary classification of the smallest element, the loser K is always a positive integer; Finally, the ClassCount dictionary is decomposed into a tuple list, then the Itemgetter method of the operator module is imported using the second line of the program, and the tuple is sorted in the order of the second element. The sort here is in reverse order, that is, sort from largest to smallest order, and finally return the most frequently occurring element label. 3.3 Handwriting recognition system implementation
The system needs to call Img2vector to do the data preprocessing, then call Classify0 to do the classification, and finally output the result of the test set classification and calculate the error rate. In the implementation, you need to import listdir from the OS to list the file names for a given directory.
Store the contents of the file in the Trainingdigits directory in the list. You can then get how many files are in the directory and store them in the variable m. Next, the code creates a training matrix of M row 1024 columns, where each row of data stores an image. We can parse out the categorical numbers from the file name. Files under this directory are named according to the rules, such as file
The classification of 9_45.txt is 9, which is the 45th instance of the number 9.
3.4 Algorithm improvement and optimization
Changing the value of the variable k, modifying the function handwritingclasstest randomly selecting training samples and changing the number of training samples will have an effect on the error rate of the K nearest neighbor algorithm.
When the algorithm is actually used, the efficiency of the algorithm is not high. Because the algorithm needs to do 2000 distance calculations for each test vector, each distance calculation includes 1024 dimension floating point operations, a total of 900 times, and we also need to prepare 2 M B of storage for the test vectors. Therefore, other algorithms can be used to make some improvements and optimizations. 4. system operation process and results display
This experiment uses Python's own IDE to edit the code, then directly call the corresponding function to run, the program input is trainingdigits and testdigits in the handwriting dataset. The output is the result of the test set recognition and the correct result and the final recognition error rate.
(1) Set K value is 3, the operation process and results are as follows:
The final recognition error rate is 1.2%.
(2) Change the K value size, set to 6, the operating process and results are as follows:
The final error rate is 2%, that is, the error rate increases as the K value increases.
(3) Change the K value size, set to 2, the result of the operation is as follows:
The final error rate is 1.4%, and the error rate is increased.
Therefore, selecting the appropriate K value can effectively reduce the error and improve the recognition accuracy.