1. Introduction to the K-NN algorithm
The K-NN algorithm (k Nearest Neighbor, K-Nearest neighbor algorithm) is a classical algorithm in machine learning, which is simple and easy to understand. The K-NN algorithm calculates the distance between the new data and the training data eigenvalues, and then chooses K (k>=1) to classify or return the nearest neighbor. If k = 1, then the new data will be assigned to its nearest neighbor class.
K-NN algorithm is a supervised learning, K-NN algorithm for classification, each training data has a clear label, you can clearly determine the new data of the label, K-nn used for regression will also be based on the value of neighbors to predict a definite value.
2. The process of the K-NN algorithm
- Select a distance calculation method to calculate the distance between the new data and the data points in the known category data by all the characteristics of the data;
- According to the order of distance increment, the K points with the minimum distance are selected.
- For discrete classification, the category with the most frequency of K points is returned as the prediction classification; For regression, a weighted value of K points is returned as the predicted value.
3. Key to the K-NN algorithm
The theory and process of the K-NN algorithm are simple, but there are several key points that require special attention.
3.1 Quantification of data characteristics
If there are non-numeric features in the data features, you need to quantify them by means of a numerical value. For example, if the sample feature contains a color (red-black-blue) feature, the distance can be calculated by converting the color to a grayscale value because there is no distance between the colors. In addition, a generic sample has multiple parameters, each with its own definition field and range of values, so they have a different effect on the distance calculation. For example, the value of a large range of parameter influence will be over the value of the smaller parameters. Therefore, in order to be fair, the sample parameter must do some scale processing, the simplest way is to take all the characteristics of the values are normalized processing.
3.2 Method of calculating distances
Distances are defined in many ways, such as Euclidean distance, cosine distance, Hamming distance, and Manhattan distance. Generally, for continuous variables, Euclidean distance is chosen as the distance measure; For the discontinuous variable of text classification, the Hamming distance is chosen as the measure. Usually, if some special algorithms are used as calculation measures, the classification precision of K-nearest neighbor algorithm can be greatly improved, such as using the large edge nearest neighbor method or the nearest neighbor component analysis method.
3.3 Determining the K value
K is a custom constant whose value directly affects the final prediction result. One way to choose K values is to use the Cross-validate (cross-validation) Error statistic selection method . Cross-validation is one part of the data sample as a training sample, and the other part as a test sample. For example, choose 95% as a training sample, the remainder as a test sample, train a machine learning model from the training data set, and then test its error rate with the test data. The Cross-validate (cross-validation) Error statistic selection method is to compare the average error rate of cross-validation with different k values, and to select the K value with the lowest error rate. For example, choose K=1, 2, 3, ..., 100 cross-validation for each k = i, calculate the average error, and select the one with the least error by comparison.
4. K-NN Classification and K-NN regression 4.1 k-nn classification
If the training sample is a multidimensional feature space vector, each training sample has a category tag (likes or dislikes, preserves, or deletes). The classification algorithm often uses the "majority vote" decision, that is, the most frequently occurring class in the K-neighbor as the prediction class. One drawback of the "majority vote" classification is that the more frequently sampled samples will dominate the predicted results of the test points, since they are more likely to appear in the K neighborhood of the test point, and the properties of the test point Samples are calculated. One way to solve this disadvantage is to take the distance from the K-neighbor to the test point when classifying. For example, sample to test point distance is D, then choose 1/D as the neighbor's weight (that is, the neighbor has the weight of the class), and then count the K neighbors all class label weights and, the most value is the new data point Prediction Class label.
For example, k=5, the example of calculating a new data point to the nearest five neighbors is (1, 3, 3, 4, 5), and five neighbor's class tag is (yes, no, no, yes, no). If the majority voting method is followed, the new data point category is no (3 no, 2 yes); If the distance weight is considered, the category is Yes (NO:2/3+1/5, YES:1+1/4).
4.2 K-nn Regression
When the category label of a data point is a continuous value, the application of the K-NN algorithm is the same as the process of the K-NN classification algorithm, the difference lies in the processing of the K-neighbor. K-nn regression is a predictor of the value of a K neighbor class tag that is weighted as a new data point. The weighted methods are: The average of the attribute values of the K nearest neighbor (the worst), the 1/d is the weight (effective to measure the neighbor's weight, so that the nearest neighbor's weight is far greater than the neighbor's right), the Gaussian function (or other appropriate subtraction function).
5. Summary
The K-Nearest neighbor algorithm is the simplest and most efficient algorithm for classifying data, and its learning is based on the example, we must have the training sample data close to the actual data when using the algorithm. The K-Nearest neighbor algorithm must hold all data sets, and if the training data set is large, it consumes a lot of storage space. In addition, because distance values must be calculated for each data in the dataset, it can be very time-consuming to actually use it. Another drawback of the K-nearest neighbor algorithm is that it cannot give any data infrastructure information, so we cannot know what the average instance sample and typical instance samples have.
Resources:
K Nearest neighbor (KNN) algorithm: http://blog.csdn.net/suipingsp/article/details/41964713
K Nearest Neighbor algorithm: http://coolshell.cn/articles/8052.html
The source of knowing a hit: http://www.cnblogs.com/lijingchn/
"Reprint" K-nn Algorithm learning Summary