KNN algorithm Introduction
The full name of KNN algorithm is k-Nearest Neighbor, which means K-Nearest Neighbor.
Algorithm Description
KNN is a classification algorithm. Its basic idea is to use the distance measurement method between different feature values for classification.
The algorithm process is as follows:
1. Prepare a sample dataset (each data in the sample has been classified into classes and has classification tags );
2. Use sample data for training;
3. Enter Test Data;
4. Calculate the distance between A and every data in the sample set;
5. sort by ascending distance;
6. Select k points with the minimum distance from;
7. Calculate the occurrence frequency of the category of the first k points;
8. The class with the highest frequency of occurrence of the first k points is returned as the prediction classification of.
Main Factor Training set (or sample data)
When the training set is too small, it will be wrong. When the training set is too large, the system overhead of testing data classification will be very large.
Distance (or similar measurement algorithm)
What is a proper distance measurement? The closer the distance, the more likely the two points belong to a classification.
Distance measurement includes:
1. Euclidean distance
Euclidean measurement (also known as euclidean distance) is a commonly used distance definition, which refers to the real distance between two points in m-dimensional space, or the natural length of the vector (that is, the distance from the point to the origin ). In 2D and 3D spaces, the Euclidean distance is the actual distance between two points.
Suitable for spatial problems.
2. Distance from Manhattan
Manhattan Distance, a taxi ry or Manhattan Distance, was created by Herman min kovski in the 19th century. It is a geometric term used in geometric measurements, used to indicate the total absolute wheelbase of two points in the standard coordinate system. The Manhattan distance is the sum of the projection distance of a line segment on the axis formed by a fixed Cartesian coordinate system in Euclidean space.
The red lines in the figure represent the distance between Manhattan, while green represents the Euclidean distance, that is, the straight line distance, while blue and yellow represent the equivalent distance between Manhattan. Manhattan distance-the distance between two points in the north and south plus the distance between the East and the West, that is, d (I, j) = | xi-xj | + | yi-yj |.
It is applicable to Path Problems.
3. Distance from cherbihov
In mathematics, the distance between two vertices is a measure in vector space. The distance between two points is defined as the maximum value of the absolute value of the difference between coordinate values.
The distance between two points in the calculation grid is used, for example, checkerboard, warehouse logistics, and other applications.
For a grid, and a vertex's point in the distance of 1, the Moore-type neighbor (Moore neighborhood) of this vertex ).
It is used to calculate the distance in the grid.
4. minkoski Distance)
Min's distance is not a set of distance definitions.
Min's distance can represent a class of distance based on variable parameters.
The formula contains a variable parameter p:
When p = 1, it is the distance from Manhattan;
When p = 2, it is the Euclidean distance;
When p → ∞, It is the distance between cherbihov.
5. Standardized Euclidean distance (Standardized Euclidean distance)
Standardized Euclidean distance is an improvement solution for the disadvantages of simple Euclidean distance. It can be regarded as a weighted Euclidean distance.
The idea of standard Euclidean distance: Since the distribution of each dimension component of the data is different, we should first standardize each component to the mean and the variance are equal.
6. Mahalanobis Distance)
The covariance distance of the data.
It is an effective way to calculate the similarity between two unknown sample sets.
Dimensional independence, which can eliminate interference between variables.
7. In statistics, Bhattacharyya Distance is used to measure the two discrete probability distributions. It is often separated between measurement classes in classification.
8. Hamming distance (Hamming distance)
The Hamming distance between s1 and s2 is defined as the minimum number of replicas required to change one of them to another.
For example, the Hamming distance between the string "1111" and "1001" is 2.
Application:
Information encoding (to enhance fault tolerance, the minimum Hamming distance between codes should be as large as possible ).
9. Cosine of the angle (Cosine)
The cosine of the angle in the ry can be used to measure the difference between two vector directions. Data mining can be used to measure the difference between sample vectors.
10. Jaccard similarity coefficient)
The difference between the two sets is determined by the ratio of different elements to all elements in the two sets.
The jiekard similarity coefficient can be used to measure the similarity of samples.
11. Pearson Correlation Coefficient)
Pearson correlation coefficient, also known as Pearson product-moment correlation coefficient, is a linear correlation coefficient. Pearson correlation coefficient is a statistic used to reflect the linear correlation between two variables.
Impact of high dimensions on distance measurement:
The larger the number of variables, the poorer the Euclidean distance differentiation capability.
The effect of variable value on distance:
Variables with a larger value range often play a dominant role in Distance Computing. Therefore, we should standardize variables first.
K size
K is too small, and the classification result is susceptible to noise points, and the error increases;
K is too large, and the nearest neighbor may contain too many other class points (weighting distance can reduce the effect of k value setting );
K = N (number of samples), it is completely fetched, because no matter what the input instance is, it simply predicts that it belongs to the most class in the training instance, and the model is too simple, it ignores a large amount of useful information in the training instance.
In practical application, the K value generally takes a relatively small value. For example, the cross-validation method is used (in simple words, some samples are used as training sets and some are used as test sets) to select the optimal K value.
Empirical rule: k is generally lower than the square root of the number of training samples.
Advantages and disadvantages
1. Advantages
Simple, easy to understand, easy to implement, high accuracy, and not sensitive to abnormal values.
2. Disadvantages
KNN is a type of lazy algorithm. It is easy to construct a model. However, the system overhead of test data classification is high (large computing volume and high memory overhead) because it needs to scan all training samples and calculate the distance.
Applicability
Numeric type and nominal type (there are several different values, and the values are unordered ).
For example, customer loss prediction and fraud detection.
Algorithm Implementation
Python is used as an example to describe the implementation of KNN algorithm based on Euclidean distance.
Euclidean distance formula:
Sample Code using Euclidean distance:
#! /usr/bin/env python#-*- coding:utf-8 -*-# E-Mail : Mike_Zhang@live.comimport mathclass KNN: def __init__(self,trainData,trainLabel,k): self.trainData = trainData self.trainLabel = trainLabel self.k = k def predict(self,inputPoint): retLable = "None" arr=[] for vector,lable in zip(self.trainData,self.trainLabel): s = 0 for i,n in enumerate(vector) : s += (n-inputPoint[i]) ** 2 arr.append([math.sqrt(s),lable]) arr = sorted(arr,key=lambda x:x[0])[:self.k] dtmp = {} for k,v in arr : if not v in dtmp : dtmp[v]=0 dtmp[v] += 1 retLable,_ = sorted(dtmp.items(),key=lambda x:x[1],reverse=True)[0] return retLabledata = [ [1.0, 1.1], [1.0, 1.0], [0.0, 0.0], [0.0, 0.1], [1.3, 1.1],]labels = ['A','A','B','B','A']knn = KNN(data,labels,3)print knn.predict([1.2, 1.1]) print knn.predict([0.2, 0.1])
The above implementation is relatively simple, and ready-made libraries can be used in development, such as scikit-learn:
Https://github.com/mike-zhang/pyExamples/blob/master/algorithm/dataMining_KNN/knn_sklearn_test1.py
Algorithm Application
- Recognize handwritten numbers
Http://www.cnblogs.com/chenbjin/p/3869745.html
Okay, that's all. I hope it will help you.
Github address:
Bytes
Please add