KNN algorithm Introduction

Source: Internet
Author: User

KNN algorithm Introduction

The full name of KNN algorithm is k-Nearest Neighbor, which means K-Nearest Neighbor.

Algorithm Description

KNN is a classification algorithm. Its basic idea is to use the distance measurement method between different feature values for classification.

The algorithm process is as follows:

1. Prepare a sample dataset (each data in the sample has been classified into classes and has classification tags );
2. Use sample data for training;
3. Enter Test Data;
4. Calculate the distance between A and every data in the sample set;
5. sort by ascending distance;
6. Select k points with the minimum distance from;
7. Calculate the occurrence frequency of the category of the first k points;
8. The class with the highest frequency of occurrence of the first k points is returned as the prediction classification of.

Main Factor Training set (or sample data)

When the training set is too small, it will be wrong. When the training set is too large, the system overhead of testing data classification will be very large.

Distance (or similar measurement algorithm)

What is a proper distance measurement? The closer the distance, the more likely the two points belong to a classification.

Distance measurement includes:

1. Euclidean distance

Euclidean measurement (also known as euclidean distance) is a commonly used distance definition, which refers to the real distance between two points in m-dimensional space, or the natural length of the vector (that is, the distance from the point to the origin ). In 2D and 3D spaces, the Euclidean distance is the actual distance between two points.

Suitable for spatial problems.

2. Distance from Manhattan

Manhattan Distance, a taxi ry or Manhattan Distance, was created by Herman min kovski in the 19th century. It is a geometric term used in geometric measurements, used to indicate the total absolute wheelbase of two points in the standard coordinate system. The Manhattan distance is the sum of the projection distance of a line segment on the axis formed by a fixed Cartesian coordinate system in Euclidean space.

The red lines in the figure represent the distance between Manhattan, while green represents the Euclidean distance, that is, the straight line distance, while blue and yellow represent the equivalent distance between Manhattan. Manhattan distance-the distance between two points in the north and south plus the distance between the East and the West, that is, d (I, j) = | xi-xj | + | yi-yj |.

It is applicable to Path Problems.

3. Distance from cherbihov

In mathematics, the distance between two vertices is a measure in vector space. The distance between two points is defined as the maximum value of the absolute value of the difference between coordinate values.

The distance between two points in the calculation grid is used, for example, checkerboard, warehouse logistics, and other applications.

For a grid, and a vertex's point in the distance of 1, the Moore-type neighbor (Moore neighborhood) of this vertex ).

It is used to calculate the distance in the grid.

4. minkoski Distance)

Min's distance is not a set of distance definitions.

Min's distance can represent a class of distance based on variable parameters.

The formula contains a variable parameter p:
When p = 1, it is the distance from Manhattan;
When p = 2, it is the Euclidean distance;
When p → ∞, It is the distance between cherbihov.

5. Standardized Euclidean distance (Standardized Euclidean distance)

Standardized Euclidean distance is an improvement solution for the disadvantages of simple Euclidean distance. It can be regarded as a weighted Euclidean distance.

The idea of standard Euclidean distance: Since the distribution of each dimension component of the data is different, we should first standardize each component to the mean and the variance are equal.

6. Mahalanobis Distance)

The covariance distance of the data.

It is an effective way to calculate the similarity between two unknown sample sets.

Dimensional independence, which can eliminate interference between variables.

7. In statistics, Bhattacharyya Distance is used to measure the two discrete probability distributions. It is often separated between measurement classes in classification.

8. Hamming distance (Hamming distance)

The Hamming distance between s1 and s2 is defined as the minimum number of replicas required to change one of them to another.

For example, the Hamming distance between the string "1111" and "1001" is 2.

Application:
Information encoding (to enhance fault tolerance, the minimum Hamming distance between codes should be as large as possible ).

9. Cosine of the angle (Cosine)

The cosine of the angle in the ry can be used to measure the difference between two vector directions. Data mining can be used to measure the difference between sample vectors.

10. Jaccard similarity coefficient)

The difference between the two sets is determined by the ratio of different elements to all elements in the two sets.
The jiekard similarity coefficient can be used to measure the similarity of samples.

11. Pearson Correlation Coefficient)

Pearson correlation coefficient, also known as Pearson product-moment correlation coefficient, is a linear correlation coefficient. Pearson correlation coefficient is a statistic used to reflect the linear correlation between two variables.

Impact of high dimensions on distance measurement:
The larger the number of variables, the poorer the Euclidean distance differentiation capability.

The effect of variable value on distance:
Variables with a larger value range often play a dominant role in Distance Computing. Therefore, we should standardize variables first.

K size

K is too small, and the classification result is susceptible to noise points, and the error increases;
K is too large, and the nearest neighbor may contain too many other class points (weighting distance can reduce the effect of k value setting );
K = N (number of samples), it is completely fetched, because no matter what the input instance is, it simply predicts that it belongs to the most class in the training instance, and the model is too simple, it ignores a large amount of useful information in the training instance.

In practical application, the K value generally takes a relatively small value. For example, the cross-validation method is used (in simple words, some samples are used as training sets and some are used as test sets) to select the optimal K value.

Empirical rule: k is generally lower than the square root of the number of training samples.

Advantages and disadvantages

1. Advantages
Simple, easy to understand, easy to implement, high accuracy, and not sensitive to abnormal values.

2. Disadvantages

KNN is a type of lazy algorithm. It is easy to construct a model. However, the system overhead of test data classification is high (large computing volume and high memory overhead) because it needs to scan all training samples and calculate the distance.

Applicability

Numeric type and nominal type (there are several different values, and the values are unordered ).
For example, customer loss prediction and fraud detection.

Algorithm Implementation

Python is used as an example to describe the implementation of KNN algorithm based on Euclidean distance.

Euclidean distance formula:

Sample Code using Euclidean distance:

#! /usr/bin/env python#-*- coding:utf-8 -*-# E-Mail : Mike_Zhang@live.comimport mathclass KNN:        def __init__(self,trainData,trainLabel,k):        self.trainData = trainData        self.trainLabel = trainLabel        self.k = k           def predict(self,inputPoint):        retLable = "None"        arr=[]        for vector,lable in zip(self.trainData,self.trainLabel):            s = 0            for i,n in enumerate(vector) :                s += (n-inputPoint[i]) ** 2            arr.append([math.sqrt(s),lable])        arr = sorted(arr,key=lambda x:x[0])[:self.k]                   dtmp = {}        for k,v in arr :            if not v in dtmp : dtmp[v]=0            dtmp[v] += 1        retLable,_ = sorted(dtmp.items(),key=lambda x:x[1],reverse=True)[0]                return retLabledata = [    [1.0, 1.1],    [1.0, 1.0],    [0.0, 0.0],    [0.0, 0.1],    [1.3, 1.1],]labels = ['A','A','B','B','A']knn = KNN(data,labels,3)print knn.predict([1.2, 1.1])  print knn.predict([0.2, 0.1])  

The above implementation is relatively simple, and ready-made libraries can be used in development, such as scikit-learn:

Https://github.com/mike-zhang/pyExamples/blob/master/algorithm/dataMining_KNN/knn_sklearn_test1.py

Algorithm Application
  • Recognize handwritten numbers

Http://www.cnblogs.com/chenbjin/p/3869745.html

Okay, that's all. I hope it will help you.

Github address:

Bytes

Please add

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.