KNNAlgorithmIt is an excellent entry-level material for machine learning. The book explains as follows: "There is a sample data set, also known as a training sample set, and each data in the sample set has tags, that is, we know the correspondence between each piece of data in the sample set and its category. After entering new data without tags, compare each feature of the new data with the features corresponding to the data in the sample set. The algorithm extracts the classification tags of the most similar feature data (nearest neighbor) in the sample set. Generally, we only select the first K most similar data in the sample dataset. This is the source of K in the K-Nearest Neighbor Algorithm. Generally, K is an integer not greater than 20. Finally, select the K categories with the most frequent occurrences of the most similar data as the category of the new data ".
Advantages: high precision, insensitive to abnormal values, and no data input assumptions.
Disadvantages: high computing complexity and high space complexity.
Applicable data range: numeric or nominal.
Python Implementation of algorithms:
Def KNN (data, dataset, datalabel, K = 3, similarity = sim_distance): Scores = [(sim_distance (data, dataset [I]), datalabel [I]) for I in range (LEN (Dataset)] sortedscore = sorted (scores, key = Lambda D: d [0], reverse = false) scores = sortedscore [0: k] classcount ={} for score in scores: classcount [score [1] = classcount. get (score [1], 0) + 1 sortedclasscount = sorted (classcount. items (), Key = Lambda D: d [1], reverse = true) return sortedclasscount [0] [0]
The following steps are used to learn this algorithm:
(1) prepare data
(2) test Algorithms
First, we will introduce a handwriting recognition system. For the sake of simplicity, this system can only recognize numbers 0-9. The numbers to be recognized have already been processed with the same color and size using graphic processing software: 32*32 pixels of black and white photos. The trainingdigits directory contains about 2000 training samples, and the testdigits directory contains about 900 test samples.
Step 1: Prepare the data: Convert the image data into a test vector.This step is to convert the 32*32 binary image matrix to a 1*1024 vector.
Def img2vector (filename): VEC = [] file = open (filename) for I in range (32): line = file. readline () for J in range (32): Vec. append (INT (line [J]) return VEC
Step 2: Test the algorithm accuracy. We useTraining samples under the trainingdigits directory to testSamples in the testdigits directory to calculate the accuracy.
Def test (): traindata, trainlabel = [], [] trainfilelist = OS. listdir ('digits/trainingdigits/') for filename in trainfilelist: traindata. append (img2vector ('digits/trainingdigits/% s' % filename) trainlabel. append (INT (filename. split ('_') [0]) succcnt, failcnt = 0, 0 testfilelist = OS. listdir ('digits/testdigits ') for filename in testfilelist: Data = img2vector ('digits/testdigits/% s' % filename) num = KNN (data, traindata, trainlabel) if num = int (filename. split ('_') [0]): succcnt + = 1 print 'succ' else: failcnt + = 1 print 'fail' print "error rate is: % F "% (failcnt/float (failcnt + succcnt ))
I tested here. K takes the default value 3 and the error rate is 0.013742,
Does not upload files, soCodePaste the test data below in chapter 2 of http://download.csdn.net/detail/wyb_009/5649337.
Import OS, mathdef sim_distance (A, B): sum_of_squares = sum ([Pow (A [I]-B [I], 2) for I in range (LEN (A)]) return sum_of_squares def KNN (data, dataset, datalabel, K = 3, similarity = sim_distance): Scores = [(sim_distance (data, dataset [I]), datalabel [I]) for I in range (LEN (Dataset)] sortedscore = sorted (scores, key = Lambda D: d [0], reverse = false) scores = sortedscore [0: K] classcount ={} for score in scores: classcount [score [1] = classcount. get (score [1], 0) + 1 sortedclasscount = sorted (classcount. items (), Key = Lambda D: d [1], reverse = true) return sortedclasscount [0] [0] def img2vector (filename ): VEC = [] file = open (filename) for I in range (32): line = file. readline () for J in range (32): Vec. append (INT (line [J]) return vecdef test (): traindata, trainlabel = [], [] trainfilelist = OS. listdir ('digits/trainingdigits/') for filename in trainfilelist: traindata. append (img2vector ('digits/trainingdigits/% s' % filename) trainlabel. append (INT (filename. split ('_') [0]) print "load train data OK" succcnt, failcnt = 0, 0 testfilelist = OS. listdir ('digits/testdigits ') for filename in testfilelist: Data = img2vector ('digits/testdigits/% s' % filename) num = KNN (data, traindata, trainlabel) if num = int (filename. split ('_') [0]): succcnt + = 1 print 'succ' else: failcnt + = 1 print 'fail: KNN get % lD, real is % ls' % (Num, INT (filename. split ('_') [0]) print "error rate is: % F" % (failcnt/float (failcnt + succcnt )) if _ name _ = "_ main _": Test ()