-knn-k nearest neighbor algorithm for data mining

Last Update:2018-07-19 Source: Internet

Author: User

Tags square root

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. The core idea of the algorithm:

By calculating the distance from each training sample to the sample to be classified, the nearest K training sample to the sample to be classified, and the majority of the training samples in that category in the K sample, indicate which category the sample to classify belongs to.

The KNN algorithm is only associated with a very small number of adjacent samples in the decision of the class. As a result, KNN is more suitable than other algorithms for classifying sample sets that intersect or overlap more categories. The results of the KNN algorithm depend largely on the choice of K.

K worth value is generally lower than the square root of the sample number of training data

1.1: European distance, Manhattan distance and cosine distance:

Euclidean distance, Manhattan distance and cosine distance, respectively

1. European distance is also known as Euclidean distance or Euclidean metric (Euclidean Metric), space-based two points between the shortest distance:?

2. Manhattan distance also known as the ma distance (Manhattan distance), but also seen more image, called Taxi distance.

The cosine distance, also known as the cosine similarity, is a measure of the magnitude of the difference between the two individuals using the cosine of the two vectors in the vector space.

Vector, is the direction of the multidimensional space line segment, if the direction of the two vectors are consistent, that is, the angle is close to 0, then the two vectors are similar. To determine whether the two vectors are in the same direction, it is necessary to use the cosine theorem to calculate the angle of the vector.

Which distance measurement method is used has a great effect on the final result. For example, your dataset has many characteristics, but if any
The Euclidean distance between each individual is equal, so you can't compare it by Euclidean distance! Manhattan Distance in some
is more stable, but if some of the eigenvalues in the data set are large, these features will
Masking the proximity of other features. Finally, the cosine distance, which applies to a lot of eigenvectors, but it
Discards some of the information contained in the vector length that might be useful in some scenarios. From:blog

2. The description of the algorithm is:

1) Calculate the distance between the test data and each training data;

2) Sort by the increment relation of distance;

3) Select K points with a minimum distance;

4) Determine the occurrence frequency of the category of the first k points;

5) return the category with the highest frequency in the first K points as the predictive classification for the test data.

3. Algorithm implementation

#coding =GBKImport NumPyAs NPImport operatorImport PandasAs PDImport Matplotlib.pyplotAs PltDefCreatedateset():#创建数据集 DataSet =np.array ([[1.0,2.0],[1.2,0.1],[0.1,1.4],[0.3,3.5]]) labels = [' A ',' A ',' B ',' B ']Return Dataset,labels# DataSet =array ([[1.0,2.0],[1.2,0.1],[0.1,1.4],[0.3,3.5]])# print (Dataset.shape) # (4, 2)# print (dataset.shape[0]) #4 output has 4 sets of data, shape[1] Returns the number of columns in the array a = Np.array ([0,1,2]) b = Np.tile (A, (2,2)) print (b)# [[0 1 2 0 1 2] #将a as a whole, print out 2 rows and 2 columns of data# [0 1 2 0 1 2]]print (b.sum (axis =1))#[6 6] outputs the and of each column#定义一个函数KNNDefClassify(Input, DataSet, labels, k): DataSize = dataset.shape[0]#计算欧式距离 Diffmat = Np.tile (input, (datasize,1))-DataSet#将输入的数据与样本数据相减 Sqdmax = Diffmat * *2#计算每个样本与输入数据的距离的平方和, sum by column sqddistance = sqdmax.sum (axis =1)#取根号. Get an array of columns to get the Euclidean distance between each data point and the input data point distances = sqddistance * *0.5 Print (' Distances: ', distances) sortdistances = Distances.argsort ()#依据元素的大小按索引进行排序, Print (' Sortdistances: ', sortdistances) ClassCount = {}#创建字典For IIn range (k):#取出前k项的类别 Votelabel = labels[sortdistances[i]] Print ('%d categories are: ', I,votelabel)#找出输入点距离最近点的label#计算类别的次数# Dict.get (Key,default=none), The Get () method of the dictionary, returns the value of the specified key, if the value is not returned in the dictionary to the default value. Classcount[votelabel] = Classcount.get (Votelabel,0) +1#key =operator.itemgetter (1) Sort by the value of the dictionary#key =operator.itemgetter (0) is sorted according to the dictionary key sortedclasscount = sorted (Classcount.items (), key = Operator.itemgetter (1), reverse =True) Print (' Sortedclasscount: ', Sortedclasscount)Return sortedclasscount[1]:0]#if __name__ = = ' __main__ ':# labels = np.array (labels). Reshape (4,1)# Print (DataSet)# Print (labels)# data = Np.concatenate ([dataset,labels],axis = 1)# Print (data)# Plt.axis ([0,3,0,3])# plt.scatter (data[:2,0],data[:2,1],color = ' red ', marker= ' o ', label= ' A ')# plt.scatter (data[2:,0],data[2:,1],color = ' green ', marker= ' + ', label= ' B ')# plt.legend (Loc =2)# plt.show () Dataset,labels = Createdateset () input = [1.1,2.4]test_class = classify (input, dataset, labels,3) print (Test_class)# distances: [0.41231056 2.30217289 1.41421356 1.36014705]# sortdistances: [0 3 2 1]#%d categories are: 0 A# Sortedclasscount: [(' A ', 1)]# A #代表新的样本是属于A类的print ( '---------') print (  "Dict.get () method and Operator.itemgetter () method of Practice") Demo_k =[ ' a ', Span class= "hljs-string" > ' B ',  ' a ',  ' a ']d = {}for i in demo_k:d[i] = d.get (I,0) +1print (d) # output {' A ': 3, ' B ': 1}, available for calculation, number of categories Sorted_d = sorted (D.items (), key = Op Erator.itemgetter (1), reverse =false)  #将值按从小到大进行排序print (sorted_d) #[(' B ', 1), (' A ', 3)]print ( Sorted_d[0][0])  #b get category

4. Advantages and disadvantages of the algorithm:

The parameters of KNN in Scikit-learn:

neighbors.KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski’, metric_params=None, n-jobs=1)

1. N_neighbors is the K in KNN, that is, when doing the classification, we select the number of points closest nearest neighbor.

2.weights is the weighting that is attached to the nearest neighbor when classifying judgment, and the default ' uniform ' is weighted by equal weight,

The ' distance ' option is weighted by the reciprocal of the distance, or it can be used by other weighted methods that are set by the user itself.

3.algorithm is the algorithm that is taken when classifying, there are ' brute ', ' kd_tree ' and ' ball_tree '. Kd_tree kd tree, while Ball_tree is another tree-based KNN algorithm, Brute is the most direct brute force calculation. Depending on the size of the sample and the number of dimensions of the feature, different algorithms have their advantages. The default ' auto ' option automatically selects the most appropriate algorithm when learning, so it is generally possible to choose Auto.

4.leaf_size is the size of the leaves (the leaves are the nodes that do not have branches in the binary tree) of the kd_tree or ball_tree trees. In the KD tree article we have only one data point in all the leaves of the two-fork tree, but in fact there can be more than one data point in the leaves, and the algorithm performs brute force calculations when it reaches the leaves. For many use cases, the size of the leaves is not very important, we set the leaf_size=1 just fine.

5.metric and P, is the option of the distance function, if metric = ' Minkowski ' and p=p, the distance between two points is

D ((x1,..., xn), (Y1,..., yn)) = (∑i=1n|xi?yi|p) 1/p

In general, the default metric= ' Minkowski ' (default) and p=2 (default) can satisfy most requirements. Additional metric options are visible in the documentation. Metric_params is a specific parameter required for some special metric options, which is None by default.

6.n_jobs is the number of threads that are computed in parallel, default is 1, and input-1 is set to the number of cores of the CPU.

Function method:

neighbors.KNeighborsClassifier.fit(X,y)

Make predictions on a dataset

neighbors.kNeighborsClassifier.predict(X)

Output prediction Probability:

neighbors.kNeighborsClassifier.predict_proba(X)

Correct rate Score

neighbors.KNeighborsClassifier.score(X, y, sample_weight=None)

#coding=gbk#KNN算法实现对电影类型的分类import numpy as npfrom sklearn import neighborsknn = neighbors.KNeighborsClassifier()data = np.array([[3,104],[2,100],[1,81],[101,10],[99,5],[98,2]])labels = [‘A‘,‘A‘,‘A‘,‘B‘,‘B‘,‘B‘]labels = np.array(labels)knn.fit(data,labels)c= knn.predict([[18,90]]) #看清楚括号的顺序print(c) print(knn.predict_proba([[18,90]]))# [‘A‘]预测为浪漫的电影# [[0.6 0.4]]

-knn-k nearest neighbor algorithm for data mining

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More