2 Machine Learning Practice notes (K-nearest neighbor)

Source: Internet
Author: User
Tags ranges

1: The algorithm is a simple narrative description

Because of the training data samples and labels, for example of the test data, from the nearest distance K training sample, this K practice sample in the category of the most class is the measured sample of the pre-measured label.

Referred to as KNN. Usually k is an integer not greater than 20, where the distance is usually the European distance.

2:python Code Implementation

Create a knn.py file and put the core code inside it.

(1) Create data

#创造数据集def CreateDataSet ():    group = Array ([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])    labels = [' A ', ' a ', ' B ', ' B ']    return Group, labels

(2) configuration of KNN classifier

#第一个kNN分类器  inx-test Data dataset-Sample data  labels-label K-Nearest Sample Def classify0 (Inx,dataset, labels, k):    #计算距离    Datasetsize = dataset.shape[0]    Diffmat = Tile (InX, (datasetsize,1))-DataSet    Sqdiffmat = Diffmat * 2    Sqdistances = sqdiffmat.sum (axis = 1)    distances = sqdistances **0.5    sorteddistindicies = Distances.argsort ()    ClassCount = {}    #选择距离最小的k个点 for    i in range (k):        Voteilabel = labels[sorteddistindicies[i]]        Classcount[voteilabel] = Classcount.get (voteilabel,0) +1    #排序    sortedclasscount = sorted ( Classcount.iteritems (), key = Operator.itemgetter (1), reverse = True)    return sortedclasscount[0][0]

Code commentary: (a) Tile function tile (InX, i), extended length tile (InX, (i,j)), I is the number of extensions, J is the extension length.

Such as:

(b) Python code path. The OS file needs to be imported and OS.GETCWD () displays the current folder. Os.chdir (") Changes the folder, Listdir () displays all the files in the current folder.

Also assuming that the current. py file has been changed, the py file (reload (knn.py)) needs to be loaded again in the Python shell to ensure that the updated content will take effect. Otherwise python will continue to use the KNN module that was last loaded. Such as:


(c) Note that the list is squared and summed

Such as:

3: Case-Dating site

Case Description:

(1) Parsing data from a text file

# The parser that logs text to conversion numpy def file2matrix (filename):    #打开文件并得到文件行数    fr = open (filename)    arrayolines = Fr.readlines ()    numberoflines = Len (arrayolines)    #创建返回的numPy矩阵    Returnmat = Zeros ((NumberOfLines, 3))    classlabelvector = []    index =0 #解析文件数据到列表 for line in    arrayolines: Line        = Line.strip ()        Listformline = Line.split (' \ t ')        returnmat[index,:] = Listformline[0:3]        classlabelvector.append (int ( LISTFORMLINE[-1])        index + = 1    return Returnmat, Classlabelvector

Code commentary: (a) First Use function Line.strip () to intercept all carriage return characters, and then use the tab character \ t to cut the entire row of data from the previous step into a list of elements

(b) int (listformline[-1]);p Ython can use the index value-1 to represent the last column element in the list. Also here we must understand the notification interpreter, telling it that the value of the element stored in the list is integer. Otherwise, the Python language will treat these elements as strings.

(2) Creating a scatter plot using the paint tool matplotlib-ability to analyze data


watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvthu1otcymdm5mzm=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "width=" 521 "height=" 389 ">


(3) Normalized value

In order to prevent the difference in the number of eigenvalues from the effect of the predicted results (for example, the calculated distance, the value of a larger value of the eigenvalues will be very large). We normalized all the eigenvalues to [0,1]

#归一化特征值def Autonorm (dataSet):    minvals = dataset.min (0);    Maxvals = Dataset.max (0);    ranges = maxvals-minvals;    Normdataset = Zeros (Shape (dataSet))    m = dataset.shape[0]    normdataset = Dataset-tile (Minvals, (m,1)    ) Normdataset = Normdataset/tile (ranges, (m,1))    return normdataset, ranges, minvals

(4) Test code

The test code takes 90% as a training sample. 10% as the test data

#測试代码def datingclasstest ():    hoRatio = 0.10    #測试数据占的百分比    datingdatamat, datinglabels = File2matrix (' DatingTestSet2.txt ')    normmat, ranges, minvals = Autonorm (datingdatamat)    m = normmat.shape[0]    numtestvecs = Int (m*horatio)    errorcount = 0.0 for    i in range (numtestvecs):        Classifierresult = Classify0 (Normmat[i,:], normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3)        print ' The classifier came back with:%d, the real answer is :%d '% (Classifierresult, datinglabels[i])        if (classifierresult! = Datinglabels[i]): Errorcount + = 1.0    print " The total error rate is:%f "% (Errorcount/float (numtestvecs))

(5) Enter a person's information. To the other side of the degree of liking

#输入某人的信息, we can get the predicted value of the other's liking Def Classifyperson ():    resultlist = [' Not at all ', ' in small doses ', ' large doses ']    perc Enttats = float (raw_input ("Percentage of time spent playing video games?

")) Ffmiles = float (raw_input (" Frequent flier miles earned per year? ")

")) icecream = float (raw_input (" liters of ice cream consumed per year? ")

")) Datingdatamat, datinglabels = File2matrix (' datingTestSet2.txt ') normmat, ranges, minvals = Autonorm ( Datingdatamat) Inarr = Array ([Ffmiles, Percenttats, icecream]) Classifierresult = Classify0 ((inarr-minvals)/ Ranges, Normmat, datinglabels,3) print ' You'll probably like this person : ', resultlist[classifierresult-1]

Code commentary: Python raw_input the user enters a text line command and returns the command entered by the user

4: Case-Handwriting recognition system

It is possible to consider a hand-written character as a 32*32 binary file consisting of 01, and then the vector converted to 1*1024 is a training sample. Each dimension is a characteristic value

(1) Convert a 32*32 binary image into a vector of 1*1024

#将一个32 *32 binary image matrix converted to 1*1024 vector def img2vector (filename):    returnvect = Zeros ((1,1024))    fr = open (filename)    for I in range (+):        linestr = Fr.readline ()        for J in Range (+):            returnvect[0, 32*i+j] = Int (linestr[j]) C6/>return Returnvect

(2) Handwriting recognition system test code

#手写识别系统測试代码def handwritingclasstest (): Hwlabels = [] trainingfilelist = Listdir (' trainingdigits ') #获取文件夹内容 m =              Len (trainingfilelist) Trainingmat = Zeros ((M, 1024x768)) for I in Range (m): Filenamestr = Trainingfilelist[i] #切割得到标签 to get categorical data from file name Resolution FILESTR = Filenamestr.split ('. ') [0] classstr = Int (Filestr.split ('_') [0]) hwlabels.append (CLASSSTR) #測试例子标签 TRAININGM At[i,:] = Img2vector (' trainingdigits/%s '% filenamestr) testfilelist = Listdir (' testdigits ') errorcount = 0.0 mTe st = Len (testfilelist) for I in Range (mtest): Filenamestr = testfilelist[i] Filestr = Filenamestr.split (' .')        [0] classstr = Int (Filestr.split ('_') [0]) Vectorundertest = Img2vector (' testdigits/%s '% filenamestr) Classifierresult = Classify0 (Vectorundertest, Trainingmat, Hwlabels, 3) print ' The classifier came back with:%d, The real answer is:%d '% (Classifierresult, classstr) if (ClassiFierresult! = classstr): Errorcount + = 1.0 print "\nthe total numbers of errors is:%d"% errorcount print "\nthe t         Otal error Rate is:%f "% (Errorcount/float (mtest))

Note: 1: This notebook is from books < machine learning combat >

Data for 2:knn.py files and notes are downloaded here (http://download.csdn.net/detail/lu597203933/7653991).

Small village head Source: http://blog.csdn.net/lu597203933 Welcome to reprint or share. But be sure to declare the source of the article.

(Sina Weibo: Small mayor Zack, Welcome to Exchange!)

Copyright notice: This article blog original articles, blogs, without consent, may not be reproduced.

2 Machine Learning Practice notes (K-nearest neighbor)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.