The K-Nearest neighbor (KNN) algorithm for machine learning

Source: Internet
Author: User
Tags diff ranges sorts

One. An overview of the K-Nearest neighbor algorithm (KNN)

The simplest initial-level classifier is a record of all the classes corresponding to the training data, which can be categorized when the properties of the test object and the properties of a training object match exactly. But how is it possible that all the test objects will find the exact match of the training object, followed by the existence of a test object at the same time with more than one training object, resulting in a training object is divided into multiple classes of the problem, based on these problems, resulting in KNN.

KNN is classified by measuring the distance between different eigenvalues. The idea is that if a sample has most of the K-similarity in the feature space (that is, the nearest neighbor in the feature space) to a category, the sample belongs to that category, where k is usually an integer that is not greater than 20. In the KNN algorithm, the selected neighbors are the objects that have been correctly categorized. This method determines the category to which the sample is to be divided based on the category of the nearest one or several samples in the categorical decision-making.

The following is a simple example of how a green circle is to be determined by which class, is it a red triangle or a blue quad? If k=3, because the red triangle is the proportion of 2/3, the green circle will be given the red triangle that class, if k=5, because the blue four-square scale is 3/5, so the green circle is given the blue four-square class.

It is also shown that the results of KNN algorithm depend largely on the choice of K.

In KNN, by calculating the distance between objects as a non-similarity between the objects, to avoid the matching between objects, where the distance between the general use of Euclidean distance or Manhattan distance:

At the same time, KNN makes decisions based on the dominant category in K-objects, rather than a single object-category decision. These two points are the advantages of the KNN algorithm.

The following is a summary of the KNN algorithm: in the training of the data and the label is known, the input test data, the characteristics of the test data and training set the corresponding characteristics of the comparison, to find the most similar training focus on the first K data, The category that corresponds to the test data is the one that has the most occurrences in K data, and its algorithm is described as:

1) Calculate the distance between the test data and each training data;

2) Sort by the increment relation of distance;

3) Select K points with a minimum distance;

4) Determine the occurrence frequency of the category of the first k points;

5) return the category with the highest frequency in the first K points as the predictive classification for the test data.

Two. Python implementation

First of all, it should be explained that I use python3.4.3, there are some usage and 2.7 or some of the discrepancy.

Establish a knn.py file to verify the feasibility of the algorithm, as follows:

#Coding:utf-8 fromNumPyImport*Importoperator##给出训练数据以及对应的类别defCreateDataSet (): Group= Array ([[[1.0,2.0],[1.2,0.1],[0.1,1.4],[0.3,3.5]]) labels= ['A','A','B','B']    returnGroup,labels## #通过KNN进行分类defclassify (input,datase t,label,k): DataSize=Dataset.shape[0]## # #计算欧式距离diff = Tile (input, (datasize,1))-DataSet Sqdiff= diff * * 2squaredist= SUM (Sqdiff,axis = 1)## #行向量分别相加, thus getting a new line of vectorsDIST = squaredist * * 0.5##对距离进行排序Sorteddistindex = Argsort (Dist)##argsort () sorts the elements from large to small based on their values, returning the subscriptClassCount={}     forIinchRange (k): Votelabel=Label[sorteddistindex[i]]## #对选取的K个样本所属的类别个数进行统计Classcount[votelabel] = Classcount.get (votelabel,0) + 1## #选取出现的类别次数最多的类别MaxCount =0 forKey,valueinchClasscount.items ():ifValue >Maxcount:maxcount=Value Classes=KeyreturnClasses
View Code

Next, enter the following code in a command-line window or a new Python file:

#-*-coding:utf-8-*-ImportSyssys.path.append ("... File path ...")ImportKNN fromNumPyImport*Dataset,labels=knn.createdataset () input= Array ([1.1,0.3]) K= 3Output=knn.classify (input,dataset,labels,k)Print("The test data is:", input,"the results of the classification are:", output)
View Code

The result after carriage return is:

The test data are: [1.1 0.3] classified as: A

The answer is in line with our expectations, to prove the accuracy of the algorithm, it is necessary to deal with complex problems to verify, followed by a separate explanation.

Three combat

Before, a simple verification of KNN, today we use KNN to improve the effect of dating sites, personal understanding, this problem can also be translated into other such as the various sites to cater to the preferences of customers to make recommendations, of course, today's example of the function is really limited.

In this case, according to the date data collected by a person, according to the main sample characteristics and the resulting classification, some unknown categories of data classification, roughly.

I am using Python 3.4.3, first create a file, such as date.py, the specific code is as follows:

#Coding:utf-8 fromNumPyImport*Importoperator fromCollectionsImportCounterImportmatplotlibImportMatplotlib.pyplot as Plt## #导入特征数据defFile2matrix (filename): Fr=open (filename) contain= Fr.readlines ()## #读取文件的所有内容Count =len (contain) Returnmat= Zeros ((count,3)) Classlabelvector=[] Index=0 forLineinchContain:line= Line.strip ()## #截取所有的回车字符Listfromline = Line.split ('\ t') Returnmat[index,:]= Listfromline[0:3]## #选取前三个元素, stored in the feature matrixClasslabelvector.append (Listfromline[-1])## #将列表的最后一列存储到向量classLabelVector中Index + = 1##将列表的最后一列由字符串转化为数字 for future calculationsDictclasslabel =Counter (classlabelvector) Classlabel=[] Kind=list (Dictclasslabel) forIteminchClasslabelvector:ifitem = =kind[0]: item= 1elifitem = = Kind[1]: Item= 2Else: Item= 3classlabel.append (item)returnReturnmat,classlabel## # #将文本中的数据导入到列表##绘图 (can visually indicate the degree of influence of each feature on the classification result)Datingdatamat,datinglabels = File2matrix ('D:\python\Mechine learing in Action\knn\datingtestset.txt') FIG=plt.figure () Ax= Fig.add_subplot (111) Ax.scatter (datingdatamat[:,0],datingdatamat[:,1],15.0*array (datinglabels), 15.0*Array (datinglabels)) plt.show ()## normalized data, guaranteed features and other weightsdefAutonorm (dataSet): Minvals=dataset.min (0) maxvals=Dataset.max (0) ranges= Maxvals-minvals Normdataset= Zeros (Shape (dataSet))##建立与dataSet结构一样的矩阵m =Dataset.shape[0] forIinchRange (1, M): Normdataset[i,:]= (Dataset[i,:]-minvals)/Rangesreturnnormdataset,ranges,minvals##KNN算法defclassify (input,dataset,label,k): DataSize=Dataset.shape[0]## # #计算欧式距离diff = Tile (input, (datasize,1))-DataSet Sqdiff= diff * * 2squaredist= SUM (Sqdiff,axis = 1)## #行向量分别相加, thus getting a new line of vectorsDIST = squaredist * * 0.5##对距离进行排序Sorteddistindex = Argsort (Dist)##argsort () sorts the elements from large to small based on their values, returning the subscriptClassCount={}     forIinchRange (k): Votelabel=Label[sorteddistindex[i]]## #对选取的K个样本所属的类别个数进行统计Classcount[votelabel] = Classcount.get (votelabel,0) + 1## #选取出现的类别次数最多的类别MaxCount =0 forKey,valueinchClasscount.items ():ifValue >Maxcount:maxcount=Value Classes=KeyreturnClasses##测试 (choose 10% test)defdatingtest (): rate= 0.10Datingdatamat,datinglabels= File2matrix ('D:\python\Mechine learing in Action\knn\datingtestset.txt') Normmat,ranges,minvals=autonorm (Datingdatamat) m=Normmat.shape[0] Testnum= Int (M *Rate ) Errorcount= 0.0 forIinchRange (1, Testnum): Classifyresult= Classify (normmat[i,:],normmat[testnum:m,:],datinglabels[testnum:m],3)        Print("the results of the classification are:,", Classifyresult)Print("The original result is:", Datinglabels[i])if(Classifyresult! =Datinglabels[i]): Errorcount+ = 1.0Print("the rate of error is:", (errorcount/float (testnum))) ## #预测函数defClassifyperson (): Resultlist= ['I don't like it at all.','There was a loss like','Grey often likes'] Percenttats= Float (Input ("How much time does it take to play video?")) Miles= Float (Input ("how many frequent flyer miles are earned each year?")) Icecream= Float (Input ("What is the number of ice cream litres consumed per week?")) Datingdatamat,datinglabels= File2matrix ('D:\python\Mechine learing in Action\knn\datingtestset2.txt') Normmat,ranges,minvals=autonorm (datingdatamat) Inarr=Array ([Miles,percenttats,icecream]) Classifierresult= Classify ((inarr-minvals)/ranges,normmat,datinglabels,3)    Print("How much you like this person:", Resultlist[classifierresult-1])
View Code

Create a new test.py file to understand the program's running results, code:

 #  coding:utf-8  from  numpy import  * import   operator  from  collections import   Counter  import   matplotlib  import   Matplotlib.pyplot as PLT  Span style= "COLOR: #0000ff" >import   Syssys.path.append ( "  d:\python\mechine learing in action\knn  "  )  import   Datedate.classifyperson ()  
View Code

Running results such as:

The K-Nearest neighbor (KNN) algorithm for machine learning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.