The K-Nearest algorithm (KNN) is a very intuitive method for classifying by measuring the distance between different eigenvalue values. This paper mainly records examples of improving dating sites using KNN algorithm.
Task one: Classification algorithm classify0
is to use the distance formula to calculate the distance between the eigenvalues, select the nearest K-point, and by counting the results of this K-point, the predicted value of the sample is obtained.
The tile function usage is here
The Argsort function is here
def classify0(inx,dataset,labels,k): #shape Returns the number of rows, shape[0] is the row count, how many tuplesDatasetsize = dataset.shape[0]#tile copy Inx so that it is the same size as the datasetDiffmat = Tile (InX, (Datasetsize,1))-DataSet#** is the exponent.Sqdiffmat = Diffmat * *2 #按行将计算结果求和Sqdistances = Sqdiffmat.sum (axis=1) Distances = sqdistances * *0.5 #使用argsort排序, returns the index valueSorteddistindicies = Distances.argsort ()#用于计数, calculation resultsClassCount = {} forIinchRange (k): Voteilabel = Labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (Voteilabel,0)+1 #按照第二个元素降序排列Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True)#返回出现次数最多的那一个label的值 returnsortedclasscount[0][0]
Task two: Read into the data
Note that the book is wrong and should be read in DatingTestSet2.txt instead of DatingTestSet.txt.
Data and examples in the book download: here
def file2matrix(filename):FR = open (filename)#打开文件, read in by lineArrayolines = Fr.readlines ()#获得文件行数NumberOfLines = Len (arrayolines)#创建m行n列的零矩阵Returnmat = Zeros (NumberOfLines,3)) Classlabelvector = [] index =0 forLineinchArrayolines:line = Line.strip ()#删除行前面的空格Listfromline = Line.split (' \ t ')#根据分隔符划分Returnmat[index,:] = listfromline[0:3]#取得每一行的内容存起来Classlabelvector.append (int (listfromline[-1])) Index + =1 returnReturnmat,classlabelvector
Task three: Drawing with Matplotlib
Installation of Matplotlib also requires NumPy, Dateutil, Pytz, Pyparsing, six, setuptools these packages. Can be downloaded here, quite full. Add to the Python27\lib\site-packages directory.
CD to DatingTestSet2.txt folder in PowerShell
Enter the python command and enter the following command
Paste:
Import Numpyimport Matplotlibimport Matplotlib.pyplot as Pltimport KNNReload(kNN)Datingdatamat,Datinglabels=KNN.File2matrix(' DatingTestSet2. txt ')Fig=PLT. Figure()Ax=Fig.Add_subplot(111)Ax.Scatter(Datingdatamat[:,1],datingdatamat[:,2],15.0*numpy. Array(datingLabels),15.0*numpy. Array (DatingLabels))PLT.Show()
The following pictures using the latter two features
Modify the Scatter function to:
ax.scatter(datingDataMat[:,0],datingDataMat[:,1],15.0*numpy.array(datingLabels),15.0*numpy.array(datingLabels))
Pictures using the top two features
Task Four: Normalization
Eliminate the effect of large numeric data on classification, and 0~1 each data into a number between the two.
def autonorm(dataSet): #找出样本集中的最小值Minvals = Dataset.min (0)#找出样本集中的最大值Maxvals = Dataset.max (0)#最大最小值之间的差值Ranges = Maxvals-minvals#创建与样本集一样大小的零矩阵Normdataset = Zeros (Shape (dataSet)) m = dataset.shape[0]#样本集中的元素与最小值的差值Normdataset = Dataset-tile (Minvals, (M,1))#数据相除, NormalizationNormdataset = Normdataset/tile (ranges, (M,1))returnNormdataset, Ranges, minvals
Task Five:
Classify and examine the data given in the book
def datingclasstest(): #选取多少数据测试分类器HoRatio =0.10 getting data #从datingTestSet2. txtDatingdatamat,datinglabels = File2matrix (' DatingTestSet2.txt ')#归一化数据Normmat, ranges, minvals = Autonorm (datingdatamat) m = normmat.shape[0]#设置测试个数numtestvecs = Int (m*horatio)#记录错误数量Errorcount =0.0 forIinchRange (Numtestvecs):#分类算法Classifierresult = Classify0 (Normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3)Print "The classifier came back with:%d, the real answer is:%d"% (Classifierresult, datinglabels[i])if(Classifierresult! = Datinglabels[i]): Errorcount + =1.0 #计算错误率 Print "The total error rate is:%f"% (Errorcount/float (numtestvecs))PrintErrorcount
Error rate is 5%
Finally, knn.py file
#encoding: Utf-8 fromNumPyImport*Importoperator#创建数据集 def createdataset():Group = Array ([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels = [' A ',' A ',' B ',' B ']returnGroup, labels#kNN实现0 def classify0(inx,dataset,labels,k): #shape Returns the number of rows, shape[0] is the row count, how many tuplesDatasetsize = dataset.shape[0]#tile copy Inx so that it is the same size as the datasetDiffmat = Tile (InX, (Datasetsize,1))-DataSet#** is the exponent.Sqdiffmat = Diffmat * *2 #按行将计算结果求和Sqdistances = Sqdiffmat.sum (axis=1) Distances = sqdistances * *0.5 #使用argsort排序, returns the index valueSorteddistindicies = Distances.argsort ()#用于计数, calculation resultsClassCount = {} forIinchRange (k): Voteilabel = Labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (Voteilabel,0)+1 #按照第二个元素降序排列Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True)#返回出现次数最多的那一个label的值 returnsortedclasscount[0][0]#从txt中读入数据 def file2matrix(filename):FR = open (filename)#打开文件, read in by lineArrayolines = Fr.readlines ()#获得文件行数NumberOfLines = Len (arrayolines)#创建m行n列的零矩阵Returnmat = Zeros (NumberOfLines,3)) Classlabelvector = [] index =0 forLineinchArrayolines:line = Line.strip ()#删除行前面的空格Listfromline = Line.split (' \ t ')#根据分隔符划分Returnmat[index,:] = listfromline[0:3]#取得每一行的内容存起来Classlabelvector.append (int (listfromline[-1])) Index + =1 returnReturnmat,classlabelvector#归一化数据 def autonorm(dataSet): #找出样本集中的最小值Minvals = Dataset.min (0)#找出样本集中的最大值Maxvals = Dataset.max (0)#最大最小值之间的差值Ranges = Maxvals-minvals#创建与样本集一样大小的零矩阵Normdataset = Zeros (Shape (dataSet)) m = dataset.shape[0]#样本集中的元素与最小值的差值Normdataset = Dataset-tile (Minvals, (M,1))#数据相除, NormalizationNormdataset = Normdataset/tile (ranges, (M,1))returnNormdataset, Ranges, minvals def datingclasstest(): #选取多少数据测试分类器HoRatio =0.50 getting data #从datingTestSet2. txtDatingdatamat,datinglabels = File2matrix (' DatingTestSet2.txt ')#归一化数据Normmat, ranges, minvals = Autonorm (datingdatamat) m = normmat.shape[0]#设置测试个数numtestvecs = Int (m*horatio)#记录错误数量Errorcount =0.0 forIinchRange (Numtestvecs):#分类算法Classifierresult = Classify0 (Normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3)Print "The classifier came back with:%d, the real answer is:%d"% (Classifierresult, datinglabels[i])if(Classifierresult! = Datinglabels[i]): Errorcount + =1.0 #计算错误率 Print "The total error rate is:%f"% (Errorcount/float (numtestvecs))PrintErrorcount
"Reading notes" machine learning Combat-KNN (1)