"Reading notes" machine learning Combat-KNN (1)

Source: Internet
Author: User
Tags ranges

The K-Nearest algorithm (KNN) is a very intuitive method for classifying by measuring the distance between different eigenvalue values. This paper mainly records examples of improving dating sites using KNN algorithm.

Task one: Classification algorithm classify0
is to use the distance formula to calculate the distance between the eigenvalues, select the nearest K-point, and by counting the results of this K-point, the predicted value of the sample is obtained.
The tile function usage is here
The Argsort function is here

 def classify0(inx,dataset,labels,k):        #shape Returns the number of rows, shape[0] is the row count, how many tuplesDatasetsize = dataset.shape[0]#tile copy Inx so that it is the same size as the datasetDiffmat = Tile (InX, (Datasetsize,1))-DataSet#** is the exponent.Sqdiffmat = Diffmat * *2                     #按行将计算结果求和Sqdistances = Sqdiffmat.sum (axis=1) Distances = sqdistances * *0.5    #使用argsort排序, returns the index valueSorteddistindicies = Distances.argsort ()#用于计数, calculation resultsClassCount = {} forIinchRange (k): Voteilabel = Labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (Voteilabel,0)+1    #按照第二个元素降序排列Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True)#返回出现次数最多的那一个label的值    returnsortedclasscount[0][0]

Task two: Read into the data

Note that the book is wrong and should be read in DatingTestSet2.txt instead of DatingTestSet.txt.
Data and examples in the book download: here

 def file2matrix(filename):FR = open (filename)#打开文件, read in by lineArrayolines = Fr.readlines ()#获得文件行数NumberOfLines = Len (arrayolines)#创建m行n列的零矩阵Returnmat = Zeros (NumberOfLines,3)) Classlabelvector = [] index =0     forLineinchArrayolines:line = Line.strip ()#删除行前面的空格Listfromline = Line.split (' \ t ')#根据分隔符划分Returnmat[index,:] = listfromline[0:3]#取得每一行的内容存起来Classlabelvector.append (int (listfromline[-1])) Index + =1    returnReturnmat,classlabelvector

Task three: Drawing with Matplotlib

Installation of Matplotlib also requires NumPy, Dateutil, Pytz, Pyparsing, six, setuptools these packages. Can be downloaded here, quite full. Add to the Python27\lib\site-packages directory.

CD to DatingTestSet2.txt folder in PowerShell
Enter the python command and enter the following command
Paste:

Import Numpyimport Matplotlibimport Matplotlib.pyplot as Pltimport KNNReload(kNN)Datingdatamat,Datinglabels=KNN.File2matrix(' DatingTestSet2. txt ')Fig=PLT. Figure()Ax=Fig.Add_subplot(111)Ax.Scatter(Datingdatamat[:,1],datingdatamat[:,2],15.0*numpy.  Array(datingLabels),15.0*numpy. Array (DatingLabels))PLT.Show()

The following pictures using the latter two features

Modify the Scatter function to:

ax.scatter(datingDataMat[:,0],datingDataMat[:,1],15.0*numpy.array(datingLabels),15.0*numpy.array(datingLabels))

Pictures using the top two features

Task Four: Normalization

Eliminate the effect of large numeric data on classification, and 0~1 each data into a number between the two.

 def autonorm(dataSet):    #找出样本集中的最小值Minvals = Dataset.min (0)#找出样本集中的最大值Maxvals = Dataset.max (0)#最大最小值之间的差值Ranges = Maxvals-minvals#创建与样本集一样大小的零矩阵Normdataset = Zeros (Shape (dataSet)) m = dataset.shape[0]#样本集中的元素与最小值的差值Normdataset = Dataset-tile (Minvals, (M,1))#数据相除, NormalizationNormdataset = Normdataset/tile (ranges, (M,1))returnNormdataset, Ranges, minvals

Task Five:

Classify and examine the data given in the book

 def datingclasstest():    #选取多少数据测试分类器HoRatio =0.10    getting data #从datingTestSet2. txtDatingdatamat,datinglabels = File2matrix (' DatingTestSet2.txt ')#归一化数据Normmat, ranges, minvals = Autonorm (datingdatamat) m = normmat.shape[0]#设置测试个数numtestvecs = Int (m*horatio)#记录错误数量Errorcount =0.0                                    forIinchRange (Numtestvecs):#分类算法Classifierresult = Classify0 (Normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3)Print "The classifier came back with:%d, the real answer is:%d"% (Classifierresult, datinglabels[i])if(Classifierresult! = Datinglabels[i]): Errorcount + =1.0      #计算错误率    Print "The total error rate is:%f"% (Errorcount/float (numtestvecs))PrintErrorcount

Error rate is 5%

Finally, knn.py file

#encoding: Utf-8 fromNumPyImport*Importoperator#创建数据集 def createdataset():Group = Array ([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels = [' A ',' A ',' B ',' B ']returnGroup, labels#kNN实现0 def classify0(inx,dataset,labels,k):        #shape Returns the number of rows, shape[0] is the row count, how many tuplesDatasetsize = dataset.shape[0]#tile copy Inx so that it is the same size as the datasetDiffmat = Tile (InX, (Datasetsize,1))-DataSet#** is the exponent.Sqdiffmat = Diffmat * *2                     #按行将计算结果求和Sqdistances = Sqdiffmat.sum (axis=1) Distances = sqdistances * *0.5    #使用argsort排序, returns the index valueSorteddistindicies = Distances.argsort ()#用于计数, calculation resultsClassCount = {} forIinchRange (k): Voteilabel = Labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (Voteilabel,0)+1    #按照第二个元素降序排列Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True)#返回出现次数最多的那一个label的值    returnsortedclasscount[0][0]#从txt中读入数据 def file2matrix(filename):FR = open (filename)#打开文件, read in by lineArrayolines = Fr.readlines ()#获得文件行数NumberOfLines = Len (arrayolines)#创建m行n列的零矩阵Returnmat = Zeros (NumberOfLines,3)) Classlabelvector = [] index =0     forLineinchArrayolines:line = Line.strip ()#删除行前面的空格Listfromline = Line.split (' \ t ')#根据分隔符划分Returnmat[index,:] = listfromline[0:3]#取得每一行的内容存起来Classlabelvector.append (int (listfromline[-1])) Index + =1    returnReturnmat,classlabelvector#归一化数据 def autonorm(dataSet):    #找出样本集中的最小值Minvals = Dataset.min (0)#找出样本集中的最大值Maxvals = Dataset.max (0)#最大最小值之间的差值Ranges = Maxvals-minvals#创建与样本集一样大小的零矩阵Normdataset = Zeros (Shape (dataSet)) m = dataset.shape[0]#样本集中的元素与最小值的差值Normdataset = Dataset-tile (Minvals, (M,1))#数据相除, NormalizationNormdataset = Normdataset/tile (ranges, (M,1))returnNormdataset, Ranges, minvals def datingclasstest():    #选取多少数据测试分类器HoRatio =0.50    getting data #从datingTestSet2. txtDatingdatamat,datinglabels = File2matrix (' DatingTestSet2.txt ')#归一化数据Normmat, ranges, minvals = Autonorm (datingdatamat) m = normmat.shape[0]#设置测试个数numtestvecs = Int (m*horatio)#记录错误数量Errorcount =0.0                                    forIinchRange (Numtestvecs):#分类算法Classifierresult = Classify0 (Normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3)Print "The classifier came back with:%d, the real answer is:%d"% (Classifierresult, datinglabels[i])if(Classifierresult! = Datinglabels[i]): Errorcount + =1.0      #计算错误率    Print "The total error rate is:%f"% (Errorcount/float (numtestvecs))PrintErrorcount

"Reading notes" machine learning Combat-KNN (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.