Example: Someone wants to build a classifier from the following 1000 lines of training sample data, dividing the data into 3 categories (like, general, dislike). There are 3 main characteristics of sample data,
A: Number of frequent flyer miles earned per year
B: Percentage of time spent playing video games
C: Consumption of ice cream litres per week
1. Reading of data
1Filename='D://machine_learn//ch02//datingtestset2.txt'2 defFile2matrix (filename):3FR =open (filename)4A=Fr.readlines ()5NumberOfLines = Len (a)#get The number of lines in the file6Returnmat = Zeros ((numberoflines,3))#prepare matrix to return7Classlabelvector = []#Prepare labels return8index=09 forLineinchA:Tenline =Line.strip () OneListfromline = Line.split ('\ t') AReturnmat[index,:] = Listfromline[0:3]#First Index line = right data -Classlabelvector.append (int (listfromline[-1])) -Index + = 1 the returnReturnmat,classlabelvector -Data,labels=file2matrix (filename)
Data
2. Normalization of data: Since the eigenvalues of a are much larger than the eigenvalues of b,c, it is necessary to standardize the data in order to transform 3 features into real equal weights.
1 defAutonorm (dataSet):2Minvals = dataset.min (0)#minimum value for each column in the matrix3Maxvals = Dataset.max (0)#maximum value of each column in the matrix4ranges = Maxvals-minvals5Normdataset =zeros (Shape (dataSet))6m =Dataset.shape[0]7Normdataset = Dataset-tile (Minvals, (m,1))8Normdataset = Normdataset/tile (ranges, (m,1))#element wise divide9 returnNormdataset, Ranges, minvals
autonorm (DataSet)
3. Using KNN algorithm to classify
3.1 First introduction to the idea of knn-algorithm
3.2 Python implements KNN
1 defclassify0 (InX, DataSet, labels, k):2Datasetsize =Dataset.shape[0]3Diffmat = Tile (InX, (datasetsize,1))-DataSet4Sqdiffmat = diffmat**25Sqdistances = Sqdiffmat.sum (Axis=1)6distances = sqdistances**0.57Sorteddistindicies =Distances.argsort ()8Classcount={} 9 forIinchRange (k):TenVoteilabel =Labels[sorteddistindicies[i]] OneClasscount[voteilabel] = Classcount.get (voteilabel,0) + 1 ASortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True) - returnSORTEDCLASSCOUNT[0][0]
knn-classify0
3.3 Using KNN in the above data and calculating the rate of miscarriage
1 defdatingclasstest ():2HoRatio = 0.50#Hold out 10%3Datingdatamat,datinglabels = File2matrix ('DatingTestSet2.txt')#Load Data setfrom file4Normmat, ranges, minvals =autonorm (Datingdatamat)5m =Normmat.shape[0]6numtestvecs = Int (m*hoRatio)7Errorcount = 0.08 forIinchRange (numtestvecs):9Classifierresult = Classify0 (normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3)Ten Print "The classifier came back with:%d, the real answer is:%d"%(Classifierresult, datinglabels[i]) One if(Classifierresult! = Datinglabels[i]): Errorcount + = 1.0 A Print "The total error rate is:%f"% (errorcount/float (numtestvecs)) - PrintErrorcount
datingclasstest
4. Visualization of classification results
1 Importmatplotlib2 ImportMatplotlib.pyplot as Plt3fig=plt.figure ()4Ax=fig.add_subplot (111)5 #Ax.scatter (data[:,0],data[:,1])6Ax.set_xlabel ('B')7Ax.set_ylabel ('C')8Ax.scatter (data[:,1],data[:,2],15.0*Array (labels), array (labels))9Ax.scatter ([20,20,20],[1.8,1.6,1.4],15*Array (list (set (labels))), list (set (labels) ))Tenlegends=['dislike','smalldoses','largedoses'] OneAx.text (22,1.8,'%s'%(Legends[0])) AAx.text (22,1.6,'%s'% (legends[1])) -Ax.text (22,1.4,'%s'% (legends[2])) -Plt.show ()
Scatter
Classification algorithm for primary knowledge (1)------KNN nearest Neighbor algorithm