About KNN Algorithm detailed introduction

Last Update:2017-06-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The KNN algorithm full name is K-nearest Neighbor, is the meaning of K nearest neighbor.

Algorithm description

KNN is a classification algorithm, and its basic idea is to classify it by measuring the distance between different eigenvalue values.

The algorithm process is as follows:

1, prepare the sample data set (each data in the sample has been divided into a good class, and has a classification label);
2, the use of sample data for training;
3, input test data A;
4, calculate the distance between a and the sample set of each data;
5, according to the order of increasing distance;
6, select K points with a distance of a minimum;
7, the occurrence frequency of the category of K points before the calculation;
8. The category of the highest frequency of the first K points is returned as the predicted classification of a.

Main factors

Training set (or sample data)

The training set is too small to be misjudged, and the system overhead of classifying the test data is very large when the training set is too large.

Distance (or similar measurement algorithm)

What is the right distance measurement? The closer the distance should mean the greater the likelihood that these two points belong to a classification.

Distance measurements include:

1, Euclidean distance

Euclidean metric (Euclidean metric) (also called Euclidean distance) is a commonly used distance definition that refers to the true distance between two points in m-dimensional space, or the natural length of the vector (that is, the distance from the point to the origin). The Euclidean distance in two and three-dimensional space is the actual distance between two points.

Applies to space problems.

2. Manhattan Distance

The taxi geometry or the Manhattan distance (Manhattan Distance) is a term created by the 19th century Minkowski, a geometrical term used in geometric metric spaces to indicate the sum of the absolute wheelbase of two points on a standard coordinate system. The Manhattan distance is the sum of the distance from the projection of the axis generated by the segment formed by Euclidean distance in a fixed Cartesian coordinate system of Euclidean space.

The red line represents the Manhattan distance, the green represents the Euclidean distance, which is the straight line distance, while the blue and yellow represent equivalent Manhattan distances. Manhattan distance-two points in the north-south direction of the distance plus the east-west upward distance, that is, D (i,j) =|xi-xj|+|yi-yj|.

Applies to path problems.

3, Chebyshev distance

In mathematics, the Chebyshev distance is a measure in the vector space, and the distance defined between two points is the maximum of the absolute value of each coordinate's numerical difference.

Chebyshev will be used to calculate the distance between two points in the French grid, such as: chessboard, warehousing and logistics applications.

To a grid, and a point of Chebyshev distance of 1 points for this point of the Moore-type neighbor (English: Moore neighborhood).

The problem that is used to calculate distances in the grid.

4, Minkowski distance (Minkowski Distance)

He distance is not a distance, but a definition of a set of distances.

Depending on the variable parameter, the He distance can represent a class of distances.

There is a parameter P in its formula:
When P=1, is the Manhattan distance;
When p=2, it is Euclidean distance;
When p→∞, it is the Chebyshev distance.

5, standardized Euclidean distance (standardized Euclidean distance)

The standardized Euclidean distance is an improved scheme for the disadvantage of simple Euclidean distance, which can be regarded as a weighted Euclidean distance.

The idea of a standard Euclidean distance: Since the distribution of each component of the data is different, the components are "normalized" to mean value, homoscedastic, etc.

6, Markov distance (Mahalanobis Distance)

Represents the covariance distance for the data.

It is an effective method for calculating the similarity of two unknown sample sets.

Dimension-Independent, can exclude the interference between the correlations between variables.

7, Babbitt distance (Bhattacharyya Distance) in statistics, the Babbitt distance is used to measure two discrete probability distributions. It is often used to measure the separation between classes in the classification.

8, Hamming distance (Hamming distance)

The Hamming distance between two equal-length strings S1 and S2 is defined as the minimum number of replacements required to change one of them into another.

For example, the Hamming distance between the string "1111" and "1001" is 2.

Application:
Information encoding (to enhance fault tolerance, the minimum hamming distance between encodings should be made as large as possible).

9. Angle cosine (cosine)

The angle cosine of the geometry can be used to measure the difference in the direction of two vectors, which can be used to measure the difference between sample vectors.

10. Jaccard Similarity coefficient (jaccard similarity coefficient)

The Jaccard distance is used to measure the sensitivity of two sets by the proportion of the elements in each of the two sets.
The Jaccard similarity coefficient can be used to measure the similarity of samples.

11. Pearson correlation coefficient (Pearson Correlation coefficient)

Pearson correlation coefficient, also known as Pearson's moment correlation coefficient (Pearson product-moment correlation coefficient), is a linear correlation coefficient. Pearson correlation coefficient is a statistic used to reflect the linear correlation of two variables.

The impact of high dimensions on distance measurement:
The more the number of variables, the less discriminating the Euclidean distance.

The effect of variable range on distance:
Variables with larger ranges often dominate the distance calculation, so the variables should be normalized first.

The size of K

K is too small, the classification results are susceptible to noise points, the error will increase;
K too large, the nearest neighbor may contain too many other categories of points (weighted distance, can reduce the effect of K-value setting);
K=n (number of samples), it is completely unworthy, because no matter what the input instance is, it simply predicts that it belongs to the most classes in the training instance, the model is too simple, ignoring a lot of useful information in the training instance.

In practical applications, K values generally take a relatively small value, for example, the use of cross-validation (in short, is part of the sample training set, part of the test set) to select the best K value.

Rule of thumb: K is generally lower than the square root of the number of training samples.

Advantages and Disadvantages

1. Advantages
Simple, easy to understand, easy to implement, high precision, insensitive to outliers.

2. Disadvantages

KNN is a lazy algorithm, the construction model is very simple, but in the test data classification of the system overhead (large computational capacity, memory overhead), because to scan all training samples and calculate the distance.

Scope of application

Numeric and nominal (with a poor number of different values, the values are unordered).
such as customer churn prediction, fraud detection and so on.

Algorithm implementation

This paper describes the implementation of KNN algorithm based on Euclidean distance in Python as an example.

Euclidean distance formula:

Example code that takes Euclidean distance as an example:

#!/usr/bin/env python#-*-coding:utf-8-*-# e-mail:mike_zhang@live.comimport MathClass knn:def __init__ (self,traindata,trainlabel,k): Self.traindata = Traindata Self.trainlabel = tr Ainlabel SELF.K = k def predict (self,inputpoint): retlable = "None" arr=[]for vector,lable in zip (self. Traindata,self.trainlabel): s = 0for i,n in enumerate (vector): s + = (N-inputpoint[i]) * * 2ARR.A Ppend ([Math.sqrt (s), lable]) arr = sorted (Arr,key=lambda x:x[0]) [: self.k] dtmp = {}for k,v in arr        : If not v in dtmp:dtmp[v]=0 dtmp[v] + = 1retlable,_ = sorted (Dtmp.items (), Key=lambda x:x[1],reverse=true) [0] return retlabledata = [[1.0, 1.1], [1.0, 1.0], [0.0, 0.0], [0.0, 0.1], [1.3, 1.1],]labels = [' A ', ' a ' , ' B ', ' B ', ' A ']KNN = KNN (data,labels,3) print knn.predict ([1.2, 1.1]) print knn.predict ([0.2, 0.1])

The above implementation is relatively simple and can be used in the development of ready-made libraries, such as Scikit-learn:

Algorithm Application

Recognize handwritten numbers

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

About KNN Algorithm detailed introduction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

About KNN Algorithm detailed introduction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support