recently in the "Machine learning actual Combat" this book, because I really want to learn more about machine learning algorithms, coupled with want to learn python, in the recommendation of a friend chose this book to learn.
A. An overview of the K-Nearest neighbor algorithm (KNN)
The simplest initial-level classifier is a record of all the classes corresponding to the training data, which can be categorized when the properties of the test object and the properties of a training object match exactly. But how is it possible that all the test objects will find the exact match of the training object, followed by the existence of a test object at the same time with more than one training object, resulting in a training object is divided into multiple classes of the problem, based on these problems, resulting in KNN.
KNN is classified by measuring the distance between different eigenvalues. The idea is that if a sample is the most similar in the K in the feature space (that is, the nearest neighbor in the feature space), the sample belongs to that category. K is usually an integer that is not greater than 20. In the KNN algorithm, the selected neighbors are the objects that have been correctly categorized. This method determines the category to which the sample is to be divided based on the category of the nearest one or several samples in the categorical decision-making.
The following is a simple example of how a green circle is to be determined by which class, is it a red triangle or a blue quad? If k=3, because the red triangle is the proportion of 2/3, the green circle will be given the red triangle that class, if k=5, because the blue four-square scale is 3/5, so the green circle is given the blue four-square class.
It is also shown that the results of KNN algorithm depend largely on the choice of K.
In KNN, by calculating the distance between objects as a non-similarity between the objects, to avoid the matching between objects, where the distance between the general use of Euclidean distance or Manhattan distance:
At the same time, KNN makes decisions based on the dominant category in K-objects, rather than a single object-category decision. These two points are the advantages of the KNN algorithm.
The following is a summary of the KNN algorithm: in the training of the data and the label is known, the input test data, the characteristics of the test data and training set the corresponding characteristics of the comparison, to find the most similar training focus on the first K data, The category that corresponds to the test data is the one that has the most occurrences in K data, and its algorithm is described as:
1) Calculate the distance between the test data and each training data;
2) Sort by the increment relation of distance;
3) Select K points with a minimum distance;
4) Determine the occurrence frequency of the category of the first k points;
5) return the category with the highest frequency in the first K points as the predictive classification for the test data.
Two. Python implementation
First of all, it should be explained that I use python3.4.3, there are some usage and 2.7 or some of the discrepancy.
Establish a knn.py file to verify the feasibility of the algorithm, as follows:
#Coding:utf-8 fromNumPyImport*Importoperator##给出训练数据以及对应的类别defCreateDataSet (): Group= Array ([[[1.0,2.0],[1.2,0.1],[0.1,1.4],[0.3,3.5]]) labels= ['A','A','B','B'] returnGroup,labels## #通过KNN进行分类defclassify (input,datase t,label,k): DataSize=Dataset.shape[0]## # #计算欧式距离diff = Tile (input, (datasize,1))-DataSet Sqdiff= diff * * 2squaredist= SUM (Sqdiff,axis = 1)## #行向量分别相加, thus getting a new line of vectorsDIST = squaredist * * 0.5##对距离进行排序Sorteddistindex = Argsort (Dist)##argsort () sorts the elements from large to small based on their values, returning the subscriptClassCount={} forIinchRange (k): Votelabel=Label[sorteddistindex[i]]## #对选取的K个样本所属的类别个数进行统计Classcount[votelabel] = Classcount.get (votelabel,0) + 1## #选取出现的类别次数最多的类别MaxCount =0 forKey,valueinchClasscount.items ():ifValue >Maxcount:maxcount=Value Classes=KeyreturnClasses
Next, enter the following code in the command-line window:
#-*-coding:utf-8-*-ImportSyssys.path.append ("... File path ...")ImportKNN fromNumPyImport*Dataset,labels=knn.createdataset () input= Array ([1.1,0.3]) K= 3Output=knn.classify (input,dataset,labels,k)Print("The test data is:", input,"the results of the classification are:", output)
The result after carriage return is:
The test data are: [1.1 0.3] classified as: A
The answer is in line with our expectations, to prove the accuracy of the algorithm, it is necessary to deal with complex problems to verify, followed by a separate explanation.
This is the first time a small program with Python, bound to encounter a variety of problems, in the course of programming debugging encountered the following problems:
1 There is a problem importing the. py file path, so you need to add the following code at the beginning:
Sys.path.append ("File path") so there is no problem with the wrong path;
2 in the Python hint code there is a problem, be sure to correct it in time, correct after saving and then execute the command line, this is not the same as MATLAB, so in Python it is best to hit the code at the same time in the command line for a period of validation;
3 the function name must be written correctly when calling the file, otherwise it will appear: ' Module ' object has no attribute ' creatdataset ';
4 ' int ' object has no attribute ' kclassify ', this problem arises because before I tell the file save named k.py, in the execution
Output = K.classify (input,dataset,labels,k) This sentence will go wrong.
These are some of the problems I encountered during the debugging process.
Three MATLAB realization
have been using MATLAB to do some of the clustering algorithm optimization, followed by some of the common algorithms, for other algorithms, but also really did not get up, the foundation is still in, the idea is still in, of course, do not want to learn the python at the same time on the Matlab gradually unfamiliar bar, walk and stop, Stop is important.
First, create the knn.m file, as follows:
%%knnclear ALLCLC%%Datatraindata= [1.0,2.0;1.2,0.1;0.1,1.4;0.3,3.5];trainclass= [1,1,2,2];testdata= [0.5,2.3];k= 3;%%Distancerow= Size (traindata,1); Col= Size (traindata,2); test= Repmat (testdata,row,1);d is= Zeros (1, row); fori = 1: Row diff=0; forj = 1: Col diff= diff + (test (I,J)-Traindata (I,j)). ^2; End Dis (1,i) = diff.^0.5; end%%Sortjointdis=[Dis;trainclass];sortdis= Sortrows (Jointdis');Sortdisclass = Sortdis';%%Findclass= Sort (2:1: k); member= Unique (class); Num=size (member); Max=0; fori = 1: num Count= Find (class==member (i)); ifCount >Max Max=count; Label=member (i); Endenddisp ('The final classification results are:'); fprintf ('%d\n', label)
After running the result is that the final classification result is: 2. The same as the expected result.
In a word, with Matlab time relatively long point, naturally also handy point, but still hope to be able to use Python in the early!
Good sleepy, thinking of writing this article earlier, also finished eating dinner did not sleep, completed!
I also hope that we can make more valuable comments ~
Machine learning (a)--k-nearest neighbor (KNN) algorithm