KNN algorithm
There is a set of training samples with n training data, each of which has a M feature. Each training data is marked with the corresponding classification. Like what:
One of the data has four characteristics: weight, wingspan, etc., there are corresponding species.
The KNN algorithm is to compare each characteristic of an unknown species with the corresponding characteristics of each data in the training sample set, and then extract the most similar data (nearest neighbor) of the sample Set feature. Generally select the first k most similar data in the sample data set, which is the source of K in the K-nearest neighbor algorithm.
Finally, the most frequently occurring classification of K most similar data is selected as the classification of new data.
(It might be understood to compare a piece of data with each data in the sample, and then calculate the distance from each sample data, and then select the nearest K-distance sample, to see which of the K samples belong to which category, which is the most, and the unknown data belongs to this sample).
KNN Algorithm implementation:
Create KNN. File
Write a training sample first (for example)
1 #-*-coding:utf-8-*-2 ImportNumPy as NP3 Importoperator4 defCreateDataSet ():5Group = Np.array ([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) #训练样本数据6Labels = ['A','A','B','B'] #每条训练样本数据对用的标签7 returnGroup, labels
Calculates the distance between the target data and the training data using the Euclidean distance formula
If a dataset has four eigenvalues, the distance between the point (1,0,0,1) and the point (7,6,9,4) is:
K-Nearest Neighbor algorithm implementation:
1 defclassify0 (inx,dataset,labels,k):2Datasetsize = dataset.shape[0]#Shape[0] Gets the number of rows of the matrix, shape[1] Gets the number of columns of the matrix3Diffmat = Np.tile (InX, (datasetsize,1))-dataSet #tile函数是将inX复制成dataSetsize行, 1 columns, which is the difference between the target data and each training data set4Sqdiffmat = diffmat**2 #对diffMat矩阵中每个值求平方5Sqdistances =sqdiffmat.sum (axis = 1)#axis=0 means adding by column, axis=1 means adding in the direction of the row6distances = sqdistances**0.5 #开根号7Sorteddistindicies = Distances.argsort ()#The elements in the distances are arranged from small to large, their corresponding index (index) is extracted, and then output to Y, for example y[0] is the index of the minimum number of values in distances .8ClassCount ={} #定义一个字典类型数据9 forIinchRange (k):TenVoteilabel =Labels[sorteddistindicies[i]] #将前k个最小数值对应的标签提取出来 OneClasscount[voteilabel] = Classcount.get (voteilabel,0) +1 #统计k个元素中每个标签出现的次数 ASortedclasscount = sorted (Classcount.items (), key = Operator.itemgetter (1), reverse=True) #逆排序, according to the number of each label count from large to small order, to understand the meaning of several functions inside - returnSortedclasscount[0][0] #返回发生频率最高的元素标签 -
To a test program, create a test.py file
# -*-coding:utf-8-*- Import = knn.createdataset ()print (knn.classify0 ([0,0],group,labels,3))
The output is:
A simple KNN algorithm implementation is completed, although not much practical use, but also constructs the first classifier.
KNN of machine learning algorithm (k nearest algorithm)