Time:
Location: Base
Bytes -----------------------------------------------------------------------------------
I. Brief Introduction
K-Nearest Neighbor (KNN) is a basic classification and regression method. The input of K nearest neighbor is the feature vector of the instance, corresponding to the point in the feature space, and output is the type of the instance. The basic idea of the K-Nearest Neighbor Algorithm is: Given a training dataset, the instance category has been set. When classifying the target instance, we will sort the training instances of K nearest neighbors with the target instance, it is determined by a majority vote. That is to say, the K-Nearest Neighbor algorithm uses the training dataset to divide the feature vector space and use it as its classification model. In this way, the K-Nearest Neighbor Algorithm involves three basic elements:
1. K value selection, that is, the size of K is the most suitable for classification.
2. Distance Measurement, that is, how to determine whether a distance calculation is the neighbor of the target instance
3. classification decision-making
Bytes -----------------------------------------------------------------------------------
II. K-Nearest Neighbor Algorithm
Input: given training Dataset
It is the feature vector of the input instance, which is the type of the instance. After modeling, we enter an instance-specific vector x
Output: Category of instance X.
Algorithm Description:
1. Based on the given distance measurement, we can find the K points nearest neighbor X in the training set T. The neighbors of X that cover these K points are recorded:
2. In the K vertex neighborhood of X, Category Y of X is determined based on the classification decision-making rule (generally majority voting), that is:
Argmax is the parameter value that maximizes the value of the expression. Here, the corresponding category value is obtained. This category maximizes the value of the right expression. That is, in the field of X, when most of the neighboring elements belong to a certain category, the right type has the largest value, and then the corresponding category (that is, the parameter value) is assigned to the result.
The K-Nearest Neighbor Algorithm is a nearest neighbor algorithm when k = 1, and the K-Nearest Neighbor Algorithm does not display the learning process.
Bytes -----------------------------------------------------------------------------------
Iii. K-Nearest Neighbor model 3.1 K-Nearest Neighbor Model
The K-Nearest Neighbor modeling is actually the feature space division. When the training set, distance measurement, K value, and classification decision-making rules are determined, the type of any new input instance is determined. Therefore, the whole process is equivalent to dividing the feature space into some sub-spaces. Our task is to determine the class of each vertex in the sub-space.
In the feature space (the dimension of the feature space is the dimension of the input vector X), for each training instance point X, we form a unit of all points closer to the point than other points. In this way, each instance point has a unit, and the units of all training instance points constitute a division of the feature space.
3.2 Distance Measurement
The distance between two instance points in the feature space is a reflection of the similarity between the two instance points. The feature space of the K-Nearest Neighbor model is formed by the n-dimensional real vector space, and there are many ways to measure the distance, for example, Euclidean distance from LP.
We set feature space X to the space of the n-dimensional real number vector. Two instances are shown in the following figure. The LP distance between the two instances is defined as follows:
P is a positive integer greater than or equal to 1, especially
When P is equal to 1, it is the distance from Manhattan (street distance)
When P is equal to 2, it is a Euclidean distance (linear distance)
When P is infinite, It is the maximum distance value in the coordinates of each dimension of the vector, that is
3.3k Value Selection
The selection of K value can have a great impact on the results of K-Nearest Neighbor Method. The reasonable selection of K value plays a great role in the accuracy of prediction results.
First, if the K value is small, it is equivalent to using a training instance in a smaller neighborhood for prediction. Of course, this will reduce the approximate learning error, only K training instances that are closer to the input instances can play a role in the prediction results. The disadvantage is that the learning estimation error increases and the prediction results are very sensitive to neighboring instance points, that is, if the instance of the nearest neighbor is a noise point, but the component of the noise point for classification increases with the decrease of the K value, it is easy to misjudge the prediction. In other words, with the decrease of K value, the overall model is more complex and prone to over-fitting.
Secondly, if the K value is large, it is equivalent to using a large neighborhood training instance for prediction, which can reduce the learning estimation error, but the disadvantage is that the learning approximate error will increase, in this case, training instances that are far away from the input instance also play a role in prediction and cause a prediction error. The increase in K value means that the overall model becomes simple. An extreme case is k = n. At this time, no matter what the input instance is, it is only the most common class in a simple prediction training instance, ignoring a large amount of useful information in the training instance, the model is too simple.
In practice, K value is a relatively small value, and cross-validation is often used to select an optimal K value.
3.4 classification decision rules
The classification rules in the K Nearest Neighbor method generally adopt the majority voting method, that is, the class of the input instance is determined by the majority of classes in the K Nearest Neighbor total training instances. Most voting rules are equivalent to minimizing empirical risks.