1: Simple Algorithm Description
Given the training data samples and labels, select the nearest K training samples for a sample data in a test, the class with the largest category among the K training samples is the prediction label of the test sample. KNN for short. Generally, K is an integer not greater than 20. The distance here is generally a Euclidean distance.
2: Python code implementation
Create a KNN. py file and put the core code in it.
(1) create data
# Create a dataset def createdataset (): Group = array ([1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]) labels = ['A', 'A', 'B', 'B'] Return group, labels
(2) KNN Classifier
# First KNN classifier samples-Test Data dataset-sample data labels-label k-k neighboring samples def classify0 (samples, dataset, labels, k ): # computing distance datasetsize = dataset. shape [0] diffmat = tile (partition, (datasetsize, 1)-dataset sqdiffmat = diffmat ** 2 sqdistances = sqdiffmat. sum (axis = 1) distances = sqdistances ** 0.5 sorteddistindicies = distances. argsort () classcount ={}# select the nearest K point for I in range (k): voteilabel = labels [sorteddistindicies [I] classcount [voteilabel] = classcount. get (voteilabel, 0) + 1 # Sort sortedclasscount = sorted (classcount. iteritems (), Key = Operator. itemgetter (1), reverse = true) return sortedclasscount [0] [0]
Code Description: (a) Tile function tile (hour, I); extended length tile (hour, (I, j); I is the number of extensions, and J is the extended length. For example:
(B) Python code path. You need to import the OS file, OS. getcwd () displays the current directory, OS. chdir ('') changes the directory. listdir () displays all files in the current directory. In addition, if the current. to reload the py file (reload (KNN. to ensure that the updated content can take effect. Otherwise, python will continue to use the last loaded KNN module. For example:
(C) Calculate the square and sum of the list.
For example:
3: case study-dating websites
Case Description:
(1) parse data from text files
# Record the text to the parsing program def file2matrix (filename) for converting numpy: # open the file and get the number of file lines Fr = open (filename) arrayolines = Fr. readlines () numberoflines = Len (arrayolines) # create the returned numpy matrix returnmat = zeros (numberoflines, 3 )) classlabelvector = [] Index = 0 # parse file data to the list for line in arrayolines: line = line. strip () listformline = line. split ('\ t') returnmat [index,:] = listformline [0: 3] classlabelvector. append (INT (listformline [-1]) index + = 1 return returnmat, classlabelvector
Code Description: (a) first use the line. Strip () function to intercept all carriage return characters, and then use the Tab character \ t to split the entire row of data obtained in the previous step into an element list.
(B) int (listformline [-1]); in Python, you can use index value-1 to indicate the last element in the list. In addition, we must explicitly inform the interpreter that the value of the elements stored in the list is an integer type. Otherwise, the Python language treats these elements as strings.
(2) Use the plotting tool matplotlib to create a scatter chart-to analyze data
(3) normalized value
To prevent the impact of differences in the number of feature values on the prediction results (for example, the calculation distance has a significant impact on the feature values), we normalize all the feature values to [0, 1].
# Normalized feature value def autonorm (Dataset): minvals = dataset. min (0); maxvals = dataset. max (0); ranges = maxvals-minvals; normdataset = zeros (shape (Dataset) M = dataset. shape [0] normdataset = dataset-tile (minvals, (m, 1) normdataset = normdataset/tile (ranges, (m, 1) return normdataset, ranges, minvals
(4) test code
The test code uses 90% as the training sample and 10% as the test data.
# Test code def datingclasstest (): Horatio = 0.10 # Percentage of test data datingdatamat, datinglabels = mat ') normmat, ranges, minvals = autonorm (datingdatamat) M = normmat. shape [0] numtestvecs = int (M * Horatio) errorcount = 0.0 for I in range (numtestvecs): classifierresult = classify0 (normmat [I,:], normmat [numtestvecs: m, :], datinglabels [numtestvecs: m], 3) print 'the classifier came back: % D, the real answer is: % d' % (classifierresult, datinglabels [I]) if (classifierresult! = Datinglabels [I]): errorcount + = 1.0 print "the total error rate is: % F" % (errorcount/float (numtestvecs ))
(5) Input someone's information to get a level of liking for the other party
# Input someone's information to get the predicted def classifyperson (): resultlist = ['Not at all', 'in small docs ', 'In large docs'] percenttats = float (raw_input ("percentage of time spent playing video games? ") Ffmiles = float (raw_input (" frequent flier miles earned per year? ") Icecream = float (raw_input (" liters of ice cream consumed per year? ") Datingdatamat, datinglabels = file2matrix('datingtestset2.txt ') normmat, ranges, minvals = autonorm (datingdatamat) inarr = array ([ffmiles, percenttats, icecream]) classifierresult = classify0 (inarr-minvals)/ranges, normmat, datinglabels, 3) print 'you will probably like this person: ', resultlist [classifierresult-1]
Code Description: raw_input in Python allows users to input text line commands and return the commands entered by users
4: case study-Handwriting Recognition System
Here we can regard Handwritten Characters as 32*32 binary files consisting of 01, and convert them into 1*1024 vectors as a training sample, with each dimension being an feature value.
(1) convert a 32*32 binary image into a 1*1024 vector.
# Convert a 32*32 binary image matrix to a 1*1024 vector def img2vector (filename): returnvect = zeros () Fr = open (filename) for I in range (32): linestr = Fr. readline () for J in range (32): returnvect [0, 32 * I + J] = int (linestr [J]) return returnvect
(2) test code of the handwriting recognition system
# Handwriting Recognition System Test code def handwritingclasstest (): hwlabels = [] trainingfilelist = listdir ('trainingdigits ') # Get directory content M = Len (trainingfilelist) trainingmat = zeros (m, 1024) for I in range (m): filenamestr = trainingfilelist [I] # split the tag to get the classification data filestr = filenamestr from file name resolution. split ('. ') [0] classstr = int (filestr. split ('_') [0]) hwlabels. append (classstr) # test sample label trainingmat [I,:] = img2vector ('trainingdigits/% s' % File Namestr) testfilelist = listdir ('testdigits ') errorcount = 0.0 mtest = Len (testfilelist) for I in range (mtest): filenamestr = testfilelist [I] filestr = filenamestr. split ('. ') [0] classstr = int (filestr. split ('_') [0]) vectorundertest = img2vector ('testdigits/% s' % filenamestr) classifierresult = classify0 (vectorundertest, trainingmat, hwlabels, 3) print 'the classifier came back with: % d, the real Nswer is: % d' % (classifierresult, classstr) if (classifierresult! = Classstr): errorcount ++ = 1.0 print "\ nthe total numbers of errors is: % d" % errorcount print "\ nthe total error rate is: % F "% (errorcount/float (mtest ))
Note: 1: This note comes from books <machine learning practices>
2: KNN. py file and note data downloaded here (http://download.csdn.net/detail/lu597203933/7653991 ).
Small village chief source: http://blog.csdn.net/lu597203933 welcome to reprint or share, but please be sure to declare the source of the article. (Sina Weibo: small village chief Zack. Thank you !)
Machine Learning Practice Note 2 (k-Nearest Neighbor Algorithm)