K-Nearest Neighbor Algorithm Summary

Source: Internet
Author: User

1. Basic Introduction

K-Nearest Neighbor (KNN) classification algorithm is a theoretically mature method and one of the simplest machine learning algorithms. The idea of this method is: if most of the k most similar samples in the feature space (that is, the most adjacent samples in the feature space) belong to a certain category, the sample also belongs to this category. In KNN algorithm, the selected neighbors are objects that have been correctly classified. This method only determines the category of the samples to be classified based on the class of one or more adjacent samples.

The KNN method also relies on the Limit Theorem in principle, but in classification decision-making, it is only related to a very small number of adjacent samples. Since the KNN method mainly relies on a limited number of adjacent samples, rather than the method used to determine the category of the class, for a class set with many overlapping or overlapping classes, KNN is more suitable than other methods.

KNN can be used for classification and regression. You can obtain the attributes of a sample by finding k nearest neighbors and assigning the average values of these neighbor attributes to the sample. A more useful method is to give different weights to the impact of different distance neighbors on the sample, for example, the weights are proportional to the distance.

The main disadvantage of this algorithm in classification is that when samples are unbalanced, for example, the sample size of a class is large, while that of other classes is small, it is possible that when a new sample is input, the samples with a large capacity class in the K neighbors of the sample account for the majority. This algorithm only calculates "nearest" neighbor samples. A certain type of samples has a large number, so either this type of samples is not close to the target sample, or this type of samples is very close to the target sample. In any case, the quantity does not affect the running result. You can use the method of weight (a large neighbor weight with a small distance from the sample) to improve. Another disadvantage of this method is that the calculation workload is large, because the distance from the text to all known samples must be calculated for each text to be classified before K Nearest Neighbor points can be obtained. Currently, the common solution is to edit known sample points in advance and remove samples that do not have much effect on classification. This algorithm is more suitable for automatic classification of class domains with a large sample size, and the class domains with a small sample size use this algorithm to easily produce false scores.

 

2. Algorithm Description

The concept of the k-nn algorithm is as follows: first, calculate the distance between the new sample and the training sample, and find the nearest K neighbors. Then, the class of the new sample is determined based on the type of the neighbor. If both belong to the same class, the new sample also belongs to this class. Otherwise, the system scores each selected class, determine the category of the new sample according to certain rules.

Take the K nearest neighbor of the unknown sample X. If we look at the class of K nearest neighbor, we will divide X into the class. That is, in K samples of X, find K nearest neighbors of X. The K-Nearest Neighbor Algorithm grows from the test sample X and continues to expand the region until it contains K training samples, in addition, the Class X of the test sample belongs to the class with the highest occurrence frequency among the nearest K training samples. For example, the Green Circle in the figure is determined to be assigned to which class, is it a red triangle or a blue square? If K = 3, because the proportion of the red triangle is 2/3, the green circle will be assigned to the class of the Red Triangle. If K = 5, because the proportion of the blue square is 3/5, therefore, the Green Circle is given a blue square

Algorithm pseudocode:

K Nearest Neighbor Search Algorithm: kNN (A [n], k)

Input: A [n] is the coordinate of N training samples in space, and k is the number of nearest neighbors.

Output: Category of x

Obtain A [1] ~ A [k], as the first neighbor of x, calculates the Euclidean distance d (x, A [I]), I =,..., between x and test sample x ,....., k; sort by d (x, A [I]) in ascending order to calculate the distance D between the farthest sample and x <----- max {d (x, a [j]) | j = 1, 2 ,....., k };

For (I = k + 1; I <= n; I ++)

Calculate the distance d (x, a [I]) between A [I] and x.

If (d (x, A [I]) <D

Then replaces the farthest sample with A [I]

Sort by d (x, A [I]) in ascending order to calculate the distance D between the farthest sample and x <--- max {d (x, A [j]) | j = 1 ,..., i}; Calculate the first k samples A [I]), I = 1, 2 ,..., the probability that k belongs to a category. The class with the highest probability is the class of sample x.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.