About KNN Algorithm detailed introduction

Source: Internet
Author: User
The KNN algorithm full name is K-nearest Neighbor, is the meaning of K nearest neighbor.

Algorithm description

KNN is a classification algorithm, and its basic idea is to classify it by measuring the distance between different eigenvalue values.

The algorithm process is as follows:

1, prepare the sample data set (each data in the sample has been divided into a good class, and has a classification label);
2, the use of sample data for training;
3, input test data A;
4, calculate the distance between a and the sample set of each data;
5, according to the order of increasing distance;
6, select K points with a distance of a minimum;
7, the occurrence frequency of the category of K points before the calculation;
8. The category of the highest frequency of the first K points is returned as the predicted classification of a.

Main factors

Training set (or sample data)

The training set is too small to be misjudged, and the system overhead of classifying the test data is very large when the training set is too large.

Distance (or similar measurement algorithm)

What is the right distance measurement? The closer the distance should mean the greater the likelihood that these two points belong to a classification.

Distance measurements include:

1, Euclidean distance

Euclidean metric (Euclidean metric) (also called Euclidean distance) is a commonly used distance definition that refers to the true distance between two points in m-dimensional space, or the natural length of the vector (that is, the distance from the point to the origin). The Euclidean distance in two and three-dimensional space is the actual distance between two points.

Applies to space problems.

2. Manhattan Distance

The taxi geometry or the Manhattan distance (Manhattan Distance) is a term created by the 19th century Minkowski, a geometrical term used in geometric metric spaces to indicate the sum of the absolute wheelbase of two points on a standard coordinate system. The Manhattan distance is the sum of the distance from the projection of the axis generated by the segment formed by Euclidean distance in a fixed Cartesian coordinate system of Euclidean space.

The red line represents the Manhattan distance, the green represents the Euclidean distance, which is the straight line distance, while the blue and yellow represent equivalent Manhattan distances. Manhattan distance-two points in the north-south direction of the distance plus the east-west upward distance, that is, D (i,j) =|xi-xj|+|yi-yj|.

Applies to path problems.

3, Chebyshev distance

In mathematics, the Chebyshev distance is a measure in the vector space, and the distance defined between two points is the maximum of the absolute value of each coordinate's numerical difference.

Chebyshev will be used to calculate the distance between two points in the French grid, such as: chessboard, warehousing and logistics applications.

To a grid, and a point of Chebyshev distance of 1 points for this point of the Moore-type neighbor (English: Moore neighborhood).

The problem that is used to calculate distances in the grid.

4, Minkowski distance (Minkowski Distance)

He distance is not a distance, but a definition of a set of distances.

Depending on the variable parameter, the He distance can represent a class of distances.

There is a parameter P in its formula:
When P=1, is the Manhattan distance;
When p=2, it is Euclidean distance;
When p→∞, it is the Chebyshev distance.

5, standardized Euclidean distance (standardized Euclidean distance)

The standardized Euclidean distance is an improved scheme for the disadvantage of simple Euclidean distance, which can be regarded as a weighted Euclidean distance.

The idea of a standard Euclidean distance: Since the distribution of each component of the data is different, the components are "normalized" to mean value, homoscedastic, etc.

6, Markov distance (Mahalanobis Distance)

Represents the covariance distance for the data.

It is an effective method for calculating the similarity of two unknown sample sets.

Dimension-Independent, can exclude the interference between the correlations between variables.

7, Babbitt distance (Bhattacharyya Distance) in statistics, the Babbitt distance is used to measure two discrete probability distributions. It is often used to measure the separation between classes in the classification.

8, Hamming distance (Hamming distance)

The Hamming distance between two equal-length strings S1 and S2 is defined as the minimum number of replacements required to change one of them into another.

For example, the Hamming distance between the string "1111" and "1001" is 2.

Application:
Information encoding (to enhance fault tolerance, the minimum hamming distance between encodings should be made as large as possible).

9. Angle cosine (cosine)

The angle cosine of the geometry can be used to measure the difference in the direction of two vectors, which can be used to measure the difference between sample vectors.

10. Jaccard Similarity coefficient (jaccard similarity coefficient)

The Jaccard distance is used to measure the sensitivity of two sets by the proportion of the elements in each of the two sets.
The Jaccard similarity coefficient can be used to measure the similarity of samples.

11. Pearson correlation coefficient (Pearson Correlation coefficient)

Pearson correlation coefficient, also known as Pearson's moment correlation coefficient (Pearson product-moment correlation coefficient), is a linear correlation coefficient. Pearson correlation coefficient is a statistic used to reflect the linear correlation of two variables.

The impact of high dimensions on distance measurement:
The more the number of variables, the less discriminating the Euclidean distance.

The effect of variable range on distance:
Variables with larger ranges often dominate the distance calculation, so the variables should be normalized first.

The size of K

K is too small, the classification results are susceptible to noise points, the error will increase;
K too large, the nearest neighbor may contain too many other categories of points (weighted distance, can reduce the effect of K-value setting);
K=n (number of samples), it is completely unworthy, because no matter what the input instance is, it simply predicts that it belongs to the most classes in the training instance, the model is too simple, ignoring a lot of useful information in the training instance.

In practical applications, K values generally take a relatively small value, for example, the use of cross-validation (in short, is part of the sample training set, part of the test set) to select the best K value.

Rule of thumb: K is generally lower than the square root of the number of training samples.

Advantages and Disadvantages

1. Advantages
Simple, easy to understand, easy to implement, high precision, insensitive to outliers.

2. Disadvantages

KNN is a lazy algorithm, the construction model is very simple, but in the test data classification of the system overhead (large computational capacity, memory overhead), because to scan all training samples and calculate the distance.

Scope of application

Numeric and nominal (with a poor number of different values, the values are unordered).
such as customer churn prediction, fraud detection and so on.

Algorithm implementation

This paper describes the implementation of KNN algorithm based on Euclidean distance in Python as an example.

Euclidean distance formula:

Example code that takes Euclidean distance as an example:

#!/usr/bin/env python#-*-coding:utf-8-*-# e-mail:mike_zhang@live.comimport MathClass knn:def __init__ (self,traindata,trainlabel,k): Self.traindata = Traindata Self.trainlabel = tr Ainlabel SELF.K = k def predict (self,inputpoint): retlable = "None" arr=[]for vector,lable in zip (self. Traindata,self.trainlabel): s = 0for i,n in enumerate (vector): s + = (N-inputpoint[i]) * * 2ARR.A Ppend ([Math.sqrt (s), lable]) arr = sorted (Arr,key=lambda x:x[0]) [: self.k] dtmp = {}for k,v in arr        : If not v in dtmp:dtmp[v]=0 dtmp[v] + = 1retlable,_ = sorted (Dtmp.items (), Key=lambda x:x[1],reverse=true) [0] return retlabledata = [[1.0, 1.1], [1.0, 1.0], [0.0, 0.0], [0.0, 0.1], [1.3, 1.1],]labels = [' A ', ' a ' , ' B ', ' B ', ' A ']KNN = KNN (data,labels,3) print knn.predict ([1.2, 1.1]) print knn.predict ([0.2, 0.1]) 

The above implementation is relatively simple and can be used in the development of ready-made libraries, such as Scikit-learn:


Algorithm Application

    • Recognize handwritten numbers

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.