Machine learning in Action Chapter II study notes: KNN algorithm

Last Update:2016-07-30 Source: Internet

Author: User

Tags for in range ranges

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This paper mainly records the contents of the second chapter in "Machine Learning in Action". The book introduces KNN (k nearest neighbors) with two specific examples, namely:

Date Object Predictions
Handwritten digit recognition

With the "Date Object" function, it is basic to understand how the KNN algorithm works. Handwritten numeral recognition uses exactly the same algorithm code as the appointment object prediction, only the data set changes.

Date Object forecast 1 appointment forecast feature requirements

The protagonist "Zhang San" likes to make new friends. "System A" above registered many similar to "Zhang San" the user, everybody wants to make the heart friend. "Zhang San" was the first to select the "System a" by its own screening of the people who feel good, and then about out to eat, but the results are not always as "Zhang San" wish, his own screening out of the object, some are really like with his interests, and some are completely with him is not a person. "Zhang San" hopes that "system a" will automatically recommend to him some of his like-minded friends, to improve his "like-minded" friends of the probability.

2 Analysis Requirements

System can not be a thin air for "Zhang San" recommend friends, must take some "already have something" as the basis and reference. This "already" is the historical record of the "Zhang San" date.

For each appointment, use three attributes to mark the object:

Number of miles flown per year
The time spent playing games per week
How much ice cream you can eat every week

Equivalent to using these three attributes, representing a person. Different people, three attribute values are different. Use vectors [Feature1, Feature2, Feature3] to represent an appointment object. The result of the appointment, there are three kinds of possible: not satisfied, also can, very satisfied. Use class to represent the result of the appointment. In this way, each history date record can be expressed as a vector [Feature1, Feature2, Feature3, class], where:

Feature1: Number of miles flown per day
Feature2: The time spent playing games per week
Feature3: How much ice cream you can eat every week
Class: Date Result

So far, "Zhang San" has a lot of dating records like [Feature1, Feature2, Feature3, class], which is what the system does: for a "Zhang San" object that has not been dated [Feature1, Feature2, Feature3], combined with Zhang San's historical dating record [Feature1, Feature2, Feature3, class], the system predicts an appointment result, and if the prediction results are "very satisfying", you can recommend this "stranger" to "Zhang San".

Clear the need, you can use machine learning the approximate routines to achieve.

3 Collecting data

Get the historical dating data for "Zhang San" [Feature1, Feature2, Feature3, class]. The author of machine learning in action has prepared the data for us, git address:

Https://github.com/pbharrin/machinelearninginaction/tree/master/Ch02

DatingTestSet.txt and DatingTestSet2.txt are data files. The main difference is that in two data files, the representation of the appointment result is different. Using strings in DatingTestSet.txt, using numbers in DatingTestSet2.txt, is essentially the same. For example, in DatingTestSet2.txt, the data is:

98682.6949770.4328182183333.9512560.333300237809.8561830.3291812181902.0689620.4299272111453.4106270.6318382688469.974715 0.66978712657510.6501020.8666273481119.1345280.7280453437577.8826011.3324463

First column: Number of miles flown in the year
Second column: Time to play the game
Column three: The number of ice creams?
Fourth column: Dating results (1: Not satisfied with 2: Also 3: Very satisfied)

4 Data Preparation

With the data, you need to put the data into the computer program in order to continue processing

To convert a file to a data structure that your program requires:

defFile2matrix (filename): Fr=open (filename) numberoflines=Len (Fr.readlines ()) Returnmat= Zeros ((NumberOfLines, 3)) Classlabelvector=[] FR=open (filename) index=0 forLineinchfr.readlines (): line=Line.strip () listfromline= Line.split ('\ t') Returnmat[index,:]= Listfromline[0:3] Classlabelvector.append (int (listfromline[-1]) Index+ = 1returnReturnmat, Classlabelvector

The numpy array is used to save the data.

5 Preprocessing data

The first three columns of the file, each column of data corresponding to a property, and the value range of the property is different, which will result in the calculation of "distance" when the value range of the property, the result has a large impact. Assuming that all attributes are of the same importance, the attribute needs to be normalized. The code is as follows:

def Autonorm (dataSet):     = dataset.min (0)    = Dataset.max (0)    = maxvals- minvals    = zeros (Shape ( DataSet))    = dataset.shape[0]    = Dataset-tile (Minvals, (M, 1))    = Normdataset/tile (ranges, (M, 1))    return normdataset, ranges, minvals

The book does not mention that when the importance of the individual attributes is not the same, if the processing, the feeling can be in this step, before returning to Normdataset, multiplied by a factor to represent the impact factor.

6 Analyzing data

This step mainly through the Pyplot to draw the data, through the graph, you can have an intuitive sense of the data, you can roughly determine the first choice between the three feature and class whether there is a certain regularity between the relationship. Drawing the Code:

defplotdatingdata (): Datingdatamat, Datinglabels= File2matrix ("DatingTestSet2.txt") Normmat, ranges, minvals=autonorm (datingdatamat) Datingdatamat=Normmat Fig=plt.figure () Ax= Fig.add_subplot (111) Ax.scatter (datingdatamat[:, 0], datingdatamat[:,1], 15.0*array (datinglabels), 10.0*Array (datinglabels)) Plt.xlabel ("flyier Miles earned Per year") Plt.ylabel ("Time spend Playing Video games") Plt.title ("Dating History") Plt.legend () plt.show ()

：

The above code only uses the "mileage" and "Play Time" two feature. which

Red: Very Satisfied
Green: You can also
Blue: Not satisfied

As you can see, the 3 results have an approximate distribution area, indicating that the two feature can be used to make predictions.

7 KNN algorithm

For a given vector, the vector distance is calculated from all vectors in the existing dataset, and the closer the distance, the more similar the two vectors are. From an existing dataset, find the first k vectors closest to the given directional amount. The first k vectors, each corresponding to a result class, with a few subordinate to the majority of the way, the most frequently occurring class, is the result of the predicted class.

The code is as follows:

defclassify0 (InX, DataSet, labels, k): Datasetsize=Dataset.shape[0] Diffmat= Tile (InX, (datasetsize, 1))-DataSet Sqdiffmat= Diffmat**2sqdistances= Sqdiffmat.sum (Axis=1) Distances= sqdistances**0.5sorteddistindicies=distances.argsort () ClassCount= {}     forIinchRange (k): Voteilabel=Labels[sorteddistindicies[i]] Classcount[voteilabel]= Classcount.get (Voteilabel, 0) + 1Sortedclasscount=Sorted (Classcount.iteritems (), Key=operator.itemgetter (1), Reverse=True)returnSORTEDCLASSCOUNT[0][0]

8 Algorithm Testing

The "Dating history data" used above is equivalent to training data. Based on the training data set, the above prediction method is produced. Testing this prediction method is good, you can use the first 10% of the "historical data" as the test data set, only use the latter 90% as the training data set. Using the data in the test data set, after running the result in the prediction method above, comparing with the actual result, if same, the prediction is correct, otherwise, the prediction error is indicated. Finally, we can get a error rate, the lower the error rate, the better the prediction method. The code is as follows:

defdatingclasstest (): HoRatio= 0.10Datingdatamat, Datinglabels= File2matrix ("DatingTestSet2.txt") Normmat, ranges, minvals=autonorm (Datingdatamat) m=Normmat.shape[0] Numtestvecs= Int (M *hoRatio) Errorcount= 0.0 forIinchRange (numtestvecs): Classifierresult= Classify0 (Normmat[i,:], normmat[numtestvecs:m,:], datinglabels[numtestvecs:m], 3)        Print "The classifier came back with:%d, the real answer is:%d"%(Classifierresult, datinglabels[i])if(Classifierresult! =Datinglabels[i]): Errorcount+ = 1.0Print "The total error rate is:%f"% (Errorcount/float (numtestvecs))

9 Practical use

Up to now, you can use the prediction method completed above to make a "date object prediction". The code is as follows:

defClassifyperson (): Resultmap= {        1:' not at all',        2:'In small doses',        3:'In large doses'} fliermiles= Float (Raw_input ("flier miles earned per year?")) Playgametime= Float (Raw_input ("Time spent playing video games?")) Icecream= Float (Raw_input ("liters of ice cream consumed per year?")) Datingdatamat, Datinglabels= File2matrix ('DatingTestSet2.txt') Normmat, ranges, minvals=autonorm (datingdatamat) inattr=Array ([Fliermiles, Playgametime, icecream]) Classifierresult= Classify0 ((inattr-minvals)/ranges, Normmat, Datinglabels, 3)    Print "You'll probably like this person :", Resultmap[classifierresult]

Handwritten digit recognition

This example is essentially the same as the "dating object Prediction", which is just different from the dataset. The dataset is in a zip package file, git address:
Https://github.com/pbharrin/machinelearninginaction/tree/master/Ch02

1 Convert a single file to a vector

Code:

def img2vector (filename):     = Zeros ((1, 1024x768))    = open (filename)    for in range (+):        = fr.readline ()          for  in range (+):            returnvect[0, I*32 + j] = int (linestr[j])    return Returnvect

2 algorithm Testing

Code:

defhandwritingclasstest (): Hwlabels=[] trainingfilelist= Os.listdir ("trainingdigits") M=Len (trainingfilelist) Trainingmat= Zeros ((M, 1024))     forIinchRange (m): Filenamestr=Trainingfilelist[i] Filestr= Filenamestr.split (".") [0] Classnumstr= Int (Filestr.split ('_') [0]) hwlabels.append (CLASSNUMSTR) trainingmat[i,:]= Img2vector (Os.path.join ("trainingdigits", FILENAMESTR)) Testfilelist= Os.listdir ("testdigits") Errorcount= 0.0mtest=Len (testfilelist) forIinchRange (mtest): Filenamestr=Testfilelist[i] Filestr= Filenamestr.split (".") [0] Classnumstr= Int (Filestr.split ("_") [0]) Vectorundertest= Img2vector (Os.path.join ("testdigits", Testfilelist[i])) Classifierresult= Classify0 (Vectorundertest, Trainingmat, Hwlabels, 3)        Print "The classifier came back with:%d, the real anwser is:%d"%(Classifierresult, CLASSNUMSTR)if(Classifierresult! =classnumstr): Errorcount+ = 1.0Print "\nthe total number of errors is:%d"%ErrorcountPrint "\nthe total error rate is:%f"% (Errorcount/float (mtest))

Summary 1 Benefits

The principle is simple, very good understanding.

Judging from the running results of the above two examples, the error rate is very low, which indicates that this method is very useful

2 Disadvantages

Is it appropriate to load all the data into a memory data structure with a large amount of data?

A single prediction, you need to calculate a distance for each data vector, and then select the first k, the calculation is too large

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More