Preface
K Nearest neighbor algorithm is a very simple idea, and the classification effect is relatively excellent classification algorithm, the most important is that the algorithm is a lot of advanced machine learning, and in the next we will learn the integration algorithm, K neighbors are often used to do the basic classifier. It's the basic idea we have introduced in the last section, here we do not repeat, this section mainly about its expansion of knowledge and implementation.
Model: All spatial partitioning, discriminant model
strategy: distance nearest K neighbor
Method: Majority vote (note, there is no computable optimization method, perhaps I did not say clearly, I feel it)
training process: Cross-validation method, to adjust K is worth choosing, The final prediction error is minimized.
Expand
The KNN algorithm is simple and easy to implement, but it is a brute force method (linear search method) without optimization (because the classification decision is the majority vote), so the algorithm efficiency is easy to reach the bottleneck when the data quantity is large. For example, the time complexity of the algorithm is O (d*n) when the number of samples is N, and the feature dimension is D. So, normally, the implementation of the KNN will build the training data kd-tree (k-dimensional Tree,kd-tree search method), the construction process is very fast, even do not calculate D for Euclidean distance, and search speed as high as O (D*log (N)). But there is also a downside to establishing a kd-tree search: When the D dimension is too high, there is a so-called "dimensional catastrophe" and the ultimate efficiency is reduced to the same as the violence law.
the KD tree is more suitable for the K nearest neighbor search when the training instance tree is much larger than the space dimension. When the space dimension is close to the number of training instances, its efficiency will drop rapidly, almost to the linear scan.
When the dimension is d>20, it is best to use a more efficient ball-tree with a time complexity of Still O (D*log (N)).
After a long time practice, it is found that KNN algorithm is suitable for irregular boundary of sample classification. KNN algorithm is more effective than other methods because it relies mainly on neighboring samples, rather than on the method of discriminant class domain to determine the category, so it is more efficient for the sample sets with overlapping or overlap of class domains.
The main disadvantage of the algorithm in classification is that when the sample is unbalanced, if the sample size of a class is large and the size of the other samples is small, it may lead to a majority in the K-neighbor of the sample when a new sample is entered. Therefore, the weights can be used (and the sample distance from the small neighbor value large) to improve.
Another disadvantage of this approach is that the computational volume is relatively large, because each of the text to be sorted calculates its distance to all known samples (of course, we also mention several Kd-tree optimization algorithms that can be avoided) in order to obtain its K nearest neighbor. At present, the common solution is to clip the known sample points beforehand, and remove the samples which have little effect on the classification. The algorithm is suitable for the automatic classification of the class domain with large sample size, while the class domain with smaller sample size can easily produce false points. Algorithm Description
In general, we already have a tagged database, and then, after entering new data without tags, compare each feature of the new data with the feature of the data in the sample set, and then the algorithm extracts the most similar (nearest neighbor) category tags in the sample set. In general, only the first k in the sample database is selected as the most similar data. Finally, select the most frequent categories in K-most similar data. Its algorithm is described as follows:
1) to compute the distance between the point in the given class dataset and the current point, 2 to sort by the
distance increment order,
3 to select the K point with the minimum distance from the current point, and
4 to determine the frequency
of the class of the former K point; 5 returns the category with the highest frequency of the first K points as the forecast classification of the current point.
#算法一 Call Method
#-*-encoding:utf-8-*-' in Sklearn calls
the method in Sklearn and uses its own data in Sklearn
@author: Ada
' '
print __doc__
import numpy as NP from
sklearn import neighbors,datasets
#下载数据
#64维 1797 Samples
Datas=datasets.load_digits ()
Totalnum=len (datas.data) #1797
#print totalnum
#print len (datas.data[ 0]
#分割数据集
#选出80% sample as the training set, the remaining 20% as test set
trainnum=int (0.8*totalnum)
trainx=datas.data[0: Trainnum]
Trainy=datas.target[0:trainnum]
testx=datas.data[trainnum:]
testy=datas.target[ Trainnum:]
#设置k值, under normal circumstances k<20
n_neighbors=10
#建立分类器
clf=neighbors. Kneighborsclassifier (n_neighbors=n_neighbors,weights= ' uniform ', algorithm= ' auto ')
#训练分类器
Clf.fit ( Trainx,trainy)
#利用训练好的分类器进行预测
answer=clf.predict (testx)
print "Error rate:%.2f%%"% (1-np.sum (answer== Testy)/float (testy) *100)
#说明:
#KNeighborsClassifier可以设置3种算法: ' Brute ', ' kd_tree ', ' ball_tree '.
#如果不知道用哪个好, set ' auto ' let kneighborsclassifier decide according to input.
About Kd-tree and Ball-tree Basics see here
#方法二: Implement a linear KNN algorithm from numpy import * Import operator #create a dataset which coantains 4 samples with 2 classes def CR Eatedataset (): #create a Matrix:each row as a sample Group=array ([[1.5,1.4],[1.6,1.5],[0.1,0.2],[0.0,0.1]]) # each Sample has two features labels=[' a ', ' a ', ' B ', ' B '] #four samples with two labels return group,labels #calssify using KNN def knnclassify (newinput,dataset,labels,k): numsamples=dataset.shape[0] #shape [0] stands for the num of row # Step 1:calculate Euclidean distance #tile (a,reps): Construce A array by repeating A reps times # eg: #A =[[1 .2,1.0]] #diff =tile (A, (4,3)) #也就是说把A看成一个整体, assign A value to 4 rows 3 columns # diff: #[[1.2 1. 1.2 1. 1.2 1. ] # [1.2 1. 1.2 1. 1.2 1. ] # [1.2 1. 1.2 1. 1.2 1. ] # [1.2 1. 1.2 1. 1.2 1.
]] #the following copy numsamples rows for DataSet diff =tile (Newinput, (numsamples,1))-dataset# calculates the difference between a test set and each training set The square squ of squarediff=diff**2# differenceAredist=sum (Squarediff,axis=1) #按行求和, that is, the operation of each sample calculation, squared and distance=squaredist**0.5 #Step 2:sort the distance by ASC
e #argsort () returns the distances that would sort a, ascending order Sorteddistance=argsort (distance) classcount={} #定义一个字典 (corresponding to each sample) for I in Xrange (k): #Step 3:choose the min k distance Votelabel=lab Els[sorteddistance[i]] # step 4:count the "Times" labels occur # when the key Votelabel isn't in Dictionar Y ClassCount, get () # would return 0 classcount[votelabel]=classcount.get (votelabel,0) +1 # step 5:the
Max voted class would return maxcount=0 to Key,value in Classcount.items (): If Value>maxcount: Maxcount=value Maxindex=key return Maxindex if __name__ = = ' __main__ ': Dataset,labels=createdat Aset () Testdata=array ([1.2,1.0]) k=3 predictlabel=knnclassify (testdata,dataset,labels,k) print ' Test data is: ', TE Stdata, the ' forecast result category is: ', predicTlabel Testdata=array ([0.1,0.3]) predictlabel=knnclassify (testdata,dataset,labels,k) print ' Test data is: ', TestData, ' Forecast result category is: ', Predictlabel
The experimental results are:
The test data is: [1.2 1.] The forecast result category is: A
test data is: [0.1 0.3] The forecast result category is: B
We will introduce the KNN algorithm for Kd-tree search in the next section.
Finish
The so-called extraordinary is the ordinary n power.
----by Ada