Handwritten recognition of KNN in Machine Learning Practice

Source: Internet
Author: User

KNNAlgorithmIt is an excellent entry-level material for machine learning. The book explains as follows: "There is a sample data set, also known as a training sample set, and each data in the sample set has tags, that is, we know the correspondence between each piece of data in the sample set and its category. After entering new data without tags, compare each feature of the new data with the features corresponding to the data in the sample set. The algorithm extracts the classification tags of the most similar feature data (nearest neighbor) in the sample set. Generally, we only select the first K most similar data in the sample dataset. This is the source of K in the K-Nearest Neighbor Algorithm. Generally, K is an integer not greater than 20. Finally, select the K categories with the most frequent occurrences of the most similar data as the category of the new data ".

Advantages: high precision, insensitive to abnormal values, and no data input assumptions.

Disadvantages: high computing complexity and high space complexity.

Applicable data range: numeric or nominal.

Python Implementation of algorithms:

Def KNN (data, dataset, datalabel, K = 3, similarity = sim_distance): Scores = [(sim_distance (data, dataset [I]), datalabel [I]) for I in range (LEN (Dataset)] sortedscore = sorted (scores, key = Lambda D: d [0], reverse = false) scores = sortedscore [0: k] classcount ={} for score in scores: classcount [score [1] = classcount. get (score [1], 0) + 1 sortedclasscount = sorted (classcount. items (), Key = Lambda D: d [1], reverse = true) return sortedclasscount [0] [0]

The following steps are used to learn this algorithm:

(1) prepare data

(2) test Algorithms

First, we will introduce a handwriting recognition system. For the sake of simplicity, this system can only recognize numbers 0-9. The numbers to be recognized have already been processed with the same color and size using graphic processing software: 32*32 pixels of black and white photos. The trainingdigits directory contains about 2000 training samples, and the testdigits directory contains about 900 test samples.

Step 1: Prepare the data: Convert the image data into a test vector.This step is to convert the 32*32 binary image matrix to a 1*1024 vector.

 
Def img2vector (filename): VEC = [] file = open (filename) for I in range (32): line = file. readline () for J in range (32): Vec. append (INT (line [J]) return VEC

Step 2: Test the algorithm accuracy. We useTraining samples under the trainingdigits directory to testSamples in the testdigits directory to calculate the accuracy.

Def test (): traindata, trainlabel = [], [] trainfilelist = OS. listdir ('digits/trainingdigits/') for filename in trainfilelist: traindata. append (img2vector ('digits/trainingdigits/% s' % filename) trainlabel. append (INT (filename. split ('_') [0]) succcnt, failcnt = 0, 0 testfilelist = OS. listdir ('digits/testdigits ') for filename in testfilelist: Data = img2vector ('digits/testdigits/% s' % filename) num = KNN (data, traindata, trainlabel) if num = int (filename. split ('_') [0]): succcnt + = 1 print 'succ' else: failcnt + = 1 print 'fail' print "error rate is: % F "% (failcnt/float (failcnt + succcnt ))

I tested here. K takes the default value 3 and the error rate is 0.013742,

Does not upload files, soCodePaste the test data below in chapter 2 of http://download.csdn.net/detail/wyb_009/5649337.

Import OS, mathdef sim_distance (A, B): sum_of_squares = sum ([Pow (A [I]-B [I], 2) for I in range (LEN (A)]) return sum_of_squares def KNN (data, dataset, datalabel, K = 3, similarity = sim_distance): Scores = [(sim_distance (data, dataset [I]), datalabel [I]) for I in range (LEN (Dataset)] sortedscore = sorted (scores, key = Lambda D: d [0], reverse = false) scores = sortedscore [0: K] classcount ={} for score in scores: classcount [score [1] = classcount. get (score [1], 0) + 1 sortedclasscount = sorted (classcount. items (), Key = Lambda D: d [1], reverse = true) return sortedclasscount [0] [0] def img2vector (filename ): VEC = [] file = open (filename) for I in range (32): line = file. readline () for J in range (32): Vec. append (INT (line [J]) return vecdef test (): traindata, trainlabel = [], [] trainfilelist = OS. listdir ('digits/trainingdigits/') for filename in trainfilelist: traindata. append (img2vector ('digits/trainingdigits/% s' % filename) trainlabel. append (INT (filename. split ('_') [0]) print "load train data OK" succcnt, failcnt = 0, 0 testfilelist = OS. listdir ('digits/testdigits ') for filename in testfilelist: Data = img2vector ('digits/testdigits/% s' % filename) num = KNN (data, traindata, trainlabel) if num = int (filename. split ('_') [0]): succcnt + = 1 print 'succ' else: failcnt + = 1 print 'fail: KNN get % lD, real is % ls' % (Num, INT (filename. split ('_') [0]) print "error rate is: % F" % (failcnt/float (failcnt + succcnt )) if _ name _ = "_ main _": Test ()





Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.