Machine Learning Practice Note 2 (k-Nearest Neighbor Algorithm)

Last Update:2014-07-19 Source: Internet

Author: User

Tags ranges

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1: Simple Algorithm Description

Given the training data samples and labels, select the nearest K training samples for a sample data in a test, the class with the largest category among the K training samples is the prediction label of the test sample. KNN for short. Generally, K is an integer not greater than 20. The distance here is generally a Euclidean distance.

2: Python code implementation

Create a KNN. py file and put the core code in it.

(1) create data

# Create a dataset def createdataset (): Group = array ([1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]) labels = ['A', 'A', 'B', 'B'] Return group, labels

(2) KNN Classifier

# First KNN classifier samples-Test Data dataset-sample data labels-label k-k neighboring samples def classify0 (samples, dataset, labels, k ): # computing distance datasetsize = dataset. shape [0] diffmat = tile (partition, (datasetsize, 1)-dataset sqdiffmat = diffmat ** 2 sqdistances = sqdiffmat. sum (axis = 1) distances = sqdistances ** 0.5 sorteddistindicies = distances. argsort () classcount ={}# select the nearest K point for I in range (k): voteilabel = labels [sorteddistindicies [I] classcount [voteilabel] = classcount. get (voteilabel, 0) + 1 # Sort sortedclasscount = sorted (classcount. iteritems (), Key = Operator. itemgetter (1), reverse = true) return sortedclasscount [0] [0]

Code Description: (a) Tile function tile (hour, I); extended length tile (hour, (I, j); I is the number of extensions, and J is the extended length. For example:

(B) Python code path. You need to import the OS file, OS. getcwd () displays the current directory, OS. chdir ('') changes the directory. listdir () displays all files in the current directory. In addition, if the current. to reload the py file (reload (KNN. to ensure that the updated content can take effect. Otherwise, python will continue to use the last loaded KNN module. For example:

For example:

3: case study-dating websites

Case Description:

(1) parse data from text files

# Record the text to the parsing program def file2matrix (filename) for converting numpy: # open the file and get the number of file lines Fr = open (filename) arrayolines = Fr. readlines () numberoflines = Len (arrayolines) # create the returned numpy matrix returnmat = zeros (numberoflines, 3 )) classlabelvector = [] Index = 0 # parse file data to the list for line in arrayolines: line = line. strip () listformline = line. split ('\ t') returnmat [index,:] = listformline [0: 3] classlabelvector. append (INT (listformline [-1]) index + = 1 return returnmat, classlabelvector

Code Description: (a) first use the line. Strip () function to intercept all carriage return characters, and then use the Tab character \ t to split the entire row of data obtained in the previous step into an element list.

(B) int (listformline [-1]); in Python, you can use index value-1 to indicate the last element in the list. In addition, we must explicitly inform the interpreter that the value of the elements stored in the list is an integer type. Otherwise, the Python language treats these elements as strings.

(2) Use the plotting tool matplotlib to create a scatter chart-to analyze data

(3) normalized value

To prevent the impact of differences in the number of feature values on the prediction results (for example, the calculation distance has a significant impact on the feature values), we normalize all the feature values to [0, 1].

# Normalized feature value def autonorm (Dataset): minvals = dataset. min (0); maxvals = dataset. max (0); ranges = maxvals-minvals; normdataset = zeros (shape (Dataset) M = dataset. shape [0] normdataset = dataset-tile (minvals, (m, 1) normdataset = normdataset/tile (ranges, (m, 1) return normdataset, ranges, minvals

(4) test code

The test code uses 90% as the training sample and 10% as the test data.

# Test code def datingclasstest (): Horatio = 0.10 # Percentage of test data datingdatamat, datinglabels = mat ') normmat, ranges, minvals = autonorm (datingdatamat) M = normmat. shape [0] numtestvecs = int (M * Horatio) errorcount = 0.0 for I in range (numtestvecs): classifierresult = classify0 (normmat [I,:], normmat [numtestvecs: m, :], datinglabels [numtestvecs: m], 3) print 'the classifier came back: % D, the real answer is: % d' % (classifierresult, datinglabels [I]) if (classifierresult! = Datinglabels [I]): errorcount + = 1.0 print "the total error rate is: % F" % (errorcount/float (numtestvecs ))

(5) Input someone's information to get a level of liking for the other party

# Input someone's information to get the predicted def classifyperson (): resultlist = ['Not at all', 'in small docs ', 'In large docs'] percenttats = float (raw_input ("percentage of time spent playing video games? ") Ffmiles = float (raw_input (" frequent flier miles earned per year? ") Icecream = float (raw_input (" liters of ice cream consumed per year? ") Datingdatamat, datinglabels = file2matrix('datingtestset2.txt ') normmat, ranges, minvals = autonorm (datingdatamat) inarr = array ([ffmiles, percenttats, icecream]) classifierresult = classify0 (inarr-minvals)/ranges, normmat, datinglabels, 3) print 'you will probably like this person: ', resultlist [classifierresult-1]

Code Description: raw_input in Python allows users to input text line commands and return the commands entered by users

4: case study-Handwriting Recognition System

Here we can regard Handwritten Characters as 32*32 binary files consisting of 01, and convert them into 1*1024 vectors as a training sample, with each dimension being an feature value.

(1) convert a 32*32 binary image into a 1*1024 vector.

# Convert a 32*32 binary image matrix to a 1*1024 vector def img2vector (filename): returnvect = zeros () Fr = open (filename) for I in range (32): linestr = Fr. readline () for J in range (32): returnvect [0, 32 * I + J] = int (linestr [J]) return returnvect

(2) test code of the handwriting recognition system

# Handwriting Recognition System Test code def handwritingclasstest (): hwlabels = [] trainingfilelist = listdir ('trainingdigits ') # Get directory content M = Len (trainingfilelist) trainingmat = zeros (m, 1024) for I in range (m): filenamestr = trainingfilelist [I] # split the tag to get the classification data filestr = filenamestr from file name resolution. split ('. ') [0] classstr = int (filestr. split ('_') [0]) hwlabels. append (classstr) # test sample label trainingmat [I,:] = img2vector ('trainingdigits/% s' % File Namestr) testfilelist = listdir ('testdigits ') errorcount = 0.0 mtest = Len (testfilelist) for I in range (mtest): filenamestr = testfilelist [I] filestr = filenamestr. split ('. ') [0] classstr = int (filestr. split ('_') [0]) vectorundertest = img2vector ('testdigits/% s' % filenamestr) classifierresult = classify0 (vectorundertest, trainingmat, hwlabels, 3) print 'the classifier came back with: % d, the real Nswer is: % d' % (classifierresult, classstr) if (classifierresult! = Classstr): errorcount ++ = 1.0 print "\ nthe total numbers of errors is: % d" % errorcount print "\ nthe total error rate is: % F "% (errorcount/float (mtest ))

Note: 1: This note comes from books <machine learning practices>

2: KNN. py file and note data downloaded here (http://download.csdn.net/detail/lu597203933/7653991 ).

Small village chief source: http://blog.csdn.net/lu597203933 welcome to reprint or share, but please be sure to declare the source of the article. (Sina Weibo: small village chief Zack. Thank you !)

Machine Learning Practice Note 2 (k-Nearest Neighbor Algorithm)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More