"Reading notes" machine learning Combat-KNN (1)

Last Update:2015-04-10 Source: Internet

Author: User

Tags ranges

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The K-Nearest algorithm (KNN) is a very intuitive method for classifying by measuring the distance between different eigenvalue values. This paper mainly records examples of improving dating sites using KNN algorithm.

Task one: Classification algorithm classify0
is to use the distance formula to calculate the distance between the eigenvalues, select the nearest K-point, and by counting the results of this K-point, the predicted value of the sample is obtained.
The tile function usage is here
The Argsort function is here

 def classify0(inx,dataset,labels,k):        #shape Returns the number of rows, shape[0] is the row count, how many tuplesDatasetsize = dataset.shape[0]#tile copy Inx so that it is the same size as the datasetDiffmat = Tile (InX, (Datasetsize,1))-DataSet#** is the exponent.Sqdiffmat = Diffmat * *2                     #按行将计算结果求和Sqdistances = Sqdiffmat.sum (axis=1) Distances = sqdistances * *0.5    #使用argsort排序, returns the index valueSorteddistindicies = Distances.argsort ()#用于计数, calculation resultsClassCount = {} forIinchRange (k): Voteilabel = Labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (Voteilabel,0)+1    #按照第二个元素降序排列Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True)#返回出现次数最多的那一个label的值    returnsortedclasscount[0][0]

Task two: Read into the data

Note that the book is wrong and should be read in DatingTestSet2.txt instead of DatingTestSet.txt.
Data and examples in the book download: here

 def file2matrix(filename):FR = open (filename)#打开文件, read in by lineArrayolines = Fr.readlines ()#获得文件行数NumberOfLines = Len (arrayolines)#创建m行n列的零矩阵Returnmat = Zeros (NumberOfLines,3)) Classlabelvector = [] index =0     forLineinchArrayolines:line = Line.strip ()#删除行前面的空格Listfromline = Line.split (' \ t ')#根据分隔符划分Returnmat[index,:] = listfromline[0:3]#取得每一行的内容存起来Classlabelvector.append (int (listfromline[-1])) Index + =1    returnReturnmat,classlabelvector

Task three: Drawing with Matplotlib

Installation of Matplotlib also requires NumPy, Dateutil, Pytz, Pyparsing, six, setuptools these packages. Can be downloaded here, quite full. Add to the Python27\lib\site-packages directory.

CD to DatingTestSet2.txt folder in PowerShell
Enter the python command and enter the following command
Paste:

Import Numpyimport Matplotlibimport Matplotlib.pyplot as Pltimport KNNReload(kNN)Datingdatamat,Datinglabels=KNN.File2matrix(' DatingTestSet2. txt ')Fig=PLT. Figure()Ax=Fig.Add_subplot(111)Ax.Scatter(Datingdatamat[:,1],datingdatamat[:,2],15.0*numpy.  Array(datingLabels),15.0*numpy. Array (DatingLabels))PLT.Show()

The following pictures using the latter two features

Modify the Scatter function to:

ax.scatter(datingDataMat[:,0],datingDataMat[:,1],15.0*numpy.array(datingLabels),15.0*numpy.array(datingLabels))

Pictures using the top two features

Task Four: Normalization

Eliminate the effect of large numeric data on classification, and 0~1 each data into a number between the two.

 def autonorm(dataSet):    #找出样本集中的最小值Minvals = Dataset.min (0)#找出样本集中的最大值Maxvals = Dataset.max (0)#最大最小值之间的差值Ranges = Maxvals-minvals#创建与样本集一样大小的零矩阵Normdataset = Zeros (Shape (dataSet)) m = dataset.shape[0]#样本集中的元素与最小值的差值Normdataset = Dataset-tile (Minvals, (M,1))#数据相除, NormalizationNormdataset = Normdataset/tile (ranges, (M,1))returnNormdataset, Ranges, minvals

Task Five:

Classify and examine the data given in the book

 def datingclasstest():    #选取多少数据测试分类器HoRatio =0.10    getting data #从datingTestSet2. txtDatingdatamat,datinglabels = File2matrix (' DatingTestSet2.txt ')#归一化数据Normmat, ranges, minvals = Autonorm (datingdatamat) m = normmat.shape[0]#设置测试个数numtestvecs = Int (m*horatio)#记录错误数量Errorcount =0.0                                    forIinchRange (Numtestvecs):#分类算法Classifierresult = Classify0 (Normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3)Print "The classifier came back with:%d, the real answer is:%d"% (Classifierresult, datinglabels[i])if(Classifierresult! = Datinglabels[i]): Errorcount + =1.0      #计算错误率    Print "The total error rate is:%f"% (Errorcount/float (numtestvecs))PrintErrorcount

Error rate is 5%

Finally, knn.py file

#encoding: Utf-8 fromNumPyImport*Importoperator#创建数据集 def createdataset():Group = Array ([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels = [' A ',' A ',' B ',' B ']returnGroup, labels#kNN实现0 def classify0(inx,dataset,labels,k):        #shape Returns the number of rows, shape[0] is the row count, how many tuplesDatasetsize = dataset.shape[0]#tile copy Inx so that it is the same size as the datasetDiffmat = Tile (InX, (Datasetsize,1))-DataSet#** is the exponent.Sqdiffmat = Diffmat * *2                     #按行将计算结果求和Sqdistances = Sqdiffmat.sum (axis=1) Distances = sqdistances * *0.5    #使用argsort排序, returns the index valueSorteddistindicies = Distances.argsort ()#用于计数, calculation resultsClassCount = {} forIinchRange (k): Voteilabel = Labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (Voteilabel,0)+1    #按照第二个元素降序排列Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True)#返回出现次数最多的那一个label的值    returnsortedclasscount[0][0]#从txt中读入数据 def file2matrix(filename):FR = open (filename)#打开文件, read in by lineArrayolines = Fr.readlines ()#获得文件行数NumberOfLines = Len (arrayolines)#创建m行n列的零矩阵Returnmat = Zeros (NumberOfLines,3)) Classlabelvector = [] index =0     forLineinchArrayolines:line = Line.strip ()#删除行前面的空格Listfromline = Line.split (' \ t ')#根据分隔符划分Returnmat[index,:] = listfromline[0:3]#取得每一行的内容存起来Classlabelvector.append (int (listfromline[-1])) Index + =1    returnReturnmat,classlabelvector#归一化数据 def autonorm(dataSet):    #找出样本集中的最小值Minvals = Dataset.min (0)#找出样本集中的最大值Maxvals = Dataset.max (0)#最大最小值之间的差值Ranges = Maxvals-minvals#创建与样本集一样大小的零矩阵Normdataset = Zeros (Shape (dataSet)) m = dataset.shape[0]#样本集中的元素与最小值的差值Normdataset = Dataset-tile (Minvals, (M,1))#数据相除, NormalizationNormdataset = Normdataset/tile (ranges, (M,1))returnNormdataset, Ranges, minvals def datingclasstest():    #选取多少数据测试分类器HoRatio =0.50    getting data #从datingTestSet2. txtDatingdatamat,datinglabels = File2matrix (' DatingTestSet2.txt ')#归一化数据Normmat, ranges, minvals = Autonorm (datingdatamat) m = normmat.shape[0]#设置测试个数numtestvecs = Int (m*horatio)#记录错误数量Errorcount =0.0                                    forIinchRange (Numtestvecs):#分类算法Classifierresult = Classify0 (Normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3)Print "The classifier came back with:%d, the real answer is:%d"% (Classifierresult, datinglabels[i])if(Classifierresult! = Datinglabels[i]): Errorcount + =1.0      #计算错误率    Print "The total error rate is:%f"% (Errorcount/float (numtestvecs))PrintErrorcount

"Reading notes" machine learning Combat-KNN (1)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More