The K-Nearest neighbor algorithm (K-NN) neighbor algorithm, or the nearest nearest neighbor (Knn,k-nearestneighbor) classification algorithm, is one of the simplest methods in data mining classification technology. The so-called K nearest neighbor is the meaning of K's closest neighbour, saying that each sample can be represented by its nearest K-neighbor. The core idea of the KNN algorithm is that if the majority of the k nearest samples in a feature space belong to a category, the sample also falls into this category and has the characteristics of the sample on this category. This method determines the category to which the sample is to be divided, depending on the category of one or more adjacent samples in determining the classification decision. The KNN method is only associated with a very small number of adjacent samples when deciding on a class. The KNN method is more suitable than other methods because the KNN method mainly relies on the surrounding finite sample, rather than the Discriminant class domain method to determine the category of the class. KNN is classified by measuring the distance between different eigenvalues. The idea is that if a sample is the most similar in the K in the feature space (that is, the nearest neighbor in the feature space), the sample belongs to that category. K is usually an integer that is not greater than 20. In the KNN algorithm, the selected neighbors are the objects that have been correctly categorized. This method determines the category to which the sample is to be divided based on the category of the nearest one or several samples in the categorical decision-making.
The following is a simple example of how a green circle is to be determined by which class, is it a red triangle or a blue quad? If k=3, because the red triangle is the proportion of 2/3, the green circle will be given the red triangle that class, if k=5, because the blue four-square scale is 3/5, so the green circle is given the blue four-square class.
It also shows that the results of KNN algorithm depend largely on the choice of K.
In KNN, by calculating the distance between objects as a non-similarity between the objects, to avoid the matching between objects, where the distance between the general use of Euclidean distance or Manhattan distance:
At the same time, KNN makes decisions based on the dominant category in K-objects, rather than a single object-category decision. These two points are the advantages of the KNN algorithm.
The following is a summary of the KNN algorithm: in the training of the data and the label is known, the input test data, the characteristics of the test data and training set the corresponding characteristics of the comparison, to find the most similar training focus on the first K data, The category that corresponds to the test data is the one that has the most occurrences in K data, and its algorithm is described as:
1) Calculate the distance between the test data and each training data;
2) Sort by the increment relation of distance;
3) Select K points with a minimum distance;
4) Determine the occurrence frequency of the category of the first k points;
5) return the category with the highest frequency in the first K points as a predictive classification of the test data
Disadvantages:
1) No model is established, the test sample and the prediction sample are compared with all training samples, and when the training set test set data is large, the computational amount will be very large;
2) K-nearest neighbor is not able to give an understandable model, not to generate a model.
Note: Calculates the similarity (or distance) between the sample and the sample. Since the category of the sample is determined by the most frequently occurring category in the K nearest neighbor, pay attention to the problem of K's value. Note that the K=1 case is not sufficient to determine the category of the test sample, because the data has noise and outliers (outliers). More neighborhood collections are needed to determine the exact category.
Reference blog:
Https://www.cnblogs.com/ybjourney/p/4702562.html
"Web Data Mining" (second edition) Bing Liu Yong Yu and other translation-------------------------------(chapter III supervised learning 3.9 K-Nearest-neighbor learning)
K-Nearest neighbor (KNN) algorithm