The KNN algorithm is simply said to be "birds of a Feather", that is, the new classification is not classified as the surrounding points of the majority of the class. It is classified by measuring the distance between different eigenvalues, and the idea is simple: if the K-points in the feature space of a sample are closest to one class (Euclidean distance), then the sample belongs to this class. This is the idea of a flock of birds.
Of course, in practice, different k values will affect the classification effect, and in the choice of K near Point, do not unexpectedly think that K points are already classified, otherwise the algorithm will lose the meaning of the flock.
The low point of KNN algorithm:
1, when the sample is unbalanced, such as a class of sample capacity is very small, the sample size of other classes, when you enter a sample, the most of the k near values are large sample capacity of that class, this can lead to classification errors. The improved method is to weighted the K-nearest point, that is, the weight of the point near the distance is large, and the point weight of the distance is small.
2, the calculation is large, each sample to be classified to calculate it to the distance of all points, according to the order of distance to find K adjacent points, the improved method is: first to the known sample points for editing, in advance to remove the small sample of the role of the classification.
Applicability:
It is suitable for automatic classification of the class domain with large sample capacity, while the class domain with smaller sample capacity is easy to be divided.
Algorithm Description:
1、计算已知类别数据集合汇总的点与当前点的距离 2、按照距离递增次序排序 3、选取与当前点距离最近的K个点 4、确定距离最近的前K个点所在类别的出现频率 5、返回距离最近的前K个点中频率最高的类别作为当前点的预测分类
Python implementation
fromNumPyImport*Importoperator def createdataset():Group = Array ([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) lables = [' A ',' A ',' B ',' B ']returnGroup,lables# KNN Classification algorithm def classify0(inx,dataset,labels,k):Datasetsize = dataset.shape[0]# Shape[0] Get row shape[1] Get column # First step, calculate Euclidean distanceDiffmat = Tile (Inx, (Datasetsize,1))-DataSet#tile类似于matlab中的repmat, copy matrixSqdiffmat = Diffmat * *2Sqdistances = Sqdiffmat.sum (axis=1) Distance = sqdistances * *0.5Sorteddistindecies = Distance.argsort ()# Sequential OrderingClassCount = {} forIinchRange (k):# Get CategoriesVoteilabel = Labels[sorteddistindecies[i]]#字典的get方法, find out if ClassCount contains Voteilabel, yes, return this value, not return defvalue, here is 0 # In fact, this is the frequency of the category that appears in the K-nearest point, in number of timesClasscount[voteilabel] = Classcount.get (Voteilabel,0) +1 # The number of categories in the dictionary is sorted, the things stored in ClassCount Key-value, where key is Label,value is the number of occurrences # so Key=operator.itemgetter (1) Select the thing value, that is, sort the number of timesSortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True)#sortedClassCount [0][0] is the label with the largest number of orders returnsortedclasscount[0][0]
Call Mode:
import sys;sys.path.append("/home/llp/code/funcdef")import KNN group,labels = KNN.createDataSet();relust = KNN.classify0([0,0],group,labels,3)print‘the classify relust is :‘ , relust
Matlab implementation
This is presented as a complete example with the following code:
function relustlabel = KNN(inx,data,labels,k) %% % inx for input test data, data for sample, labels as sample label%%[DataRow, Datacol]=size(data);d Iffmat =Repmat(Inx,[DataRow,1])-Data;d Istancemat =sqrt(SUM (Diffmat.^2,2));[B, IX]= Sort (Distancemat,' Ascend '); len = min (k,length(B)); Relustlabel = mode (Labels (IX (1: Len));End
It can be seen that the entire KNN algorithm matlab code is only 6 lines, much less than Python, which is due to the MATLAB mature matrix calculation and a lot of mature functions.
In the actual call, we use a data set to test, which is composed of 1000 samples of 3-D coordinates, divided into 3 classes
First visualize our data set and see how it's distributed:
First and second dimension: You can clearly see that the data is broadly divided into 3 categories
1th and 3rd Dimension: From this point of view, the distribution of category 3 is somewhat overlapping, because of our perspective
Draw a 3-dimensional look at its truth:
Since we have already written the KNN code, we just need to invoke the line. People who have learned about machine learning should know that many sample data should be normalized before the generation algorithm, where we normalized the data in the [0,1] range, normalized as follows:
Where Max is the maximum value of OldData, Min is the minimum value of OldData.
At the same time, we have 1000 data sets, take 10% of the data to test, 90% of the data to train the way, because this test data is completely independent, you can randomly extract 10% of the data as test data, the code is as follows:
function knndatgingtest %%Clcclearclose All%%data = Load (' DatingTestSet2.txt ');d Atamat = data (:,1:3); labels = data (:,4); len =size(Datamat,1); k =4; error =0;% test Data ratioRatio =0.1; numtest = Ratio * len;% Normalization of processingMAXV = max (Datamat); minv = min (datamat); Range = Maxv-minv;newdatamat = (datamat-Repmat(MINV,[Len,1]))./(Repmat(Range,[Len,1]));% Test for I=1: Numtest Classifyresult = KNN (Newdatamat (I,:), Newdatamat (Numtest:len,:), labels (numtest:len,:), k); fprintf (' test Result:%d true result:%d\n ',[Classifyresult labels (i)])if(Classifyresult~=labels (I)) Error = error+1;EndEndfprintf (' accuracy:%f\n ',1-error/(Numtest))End
When we choose K for 4, the accuracy is: 0.970000
Machine learning Combat Bymatlab (a) KNN algorithm