Machine Learning II: K-Nearest neighbor (KNN) algorithm

Last Update:2015-11-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Overview

K Nearest neighbor (k-nearest NEIGHBOR,KNN) classification algorithm is a theoretically mature method and one of the simplest machine learning algorithms. The idea of this approach is that if a sample is in the K most similar in the feature space (that is, the nearest neighbor in the feature space) Most of the samples belong to a category, then the sample belongs to that category. In the KNN algorithm, the selected neighbors are the objects that have been correctly categorized. This method determines the category to which the sample is to be divided based on the category of the nearest one or several samples in the categorical decision-making. Although the KNN method relies on the limit theorem in principle, it is only related to a very small number of adjacent samples in the class decision. The KNN method is more suitable than other methods because the KNN method mainly relies on the surrounding finite sample, rather than the Discriminant class domain method to determine the category of the class.

Second, the principle

The basic idea is that if a sample in the feature space of K is most similar to the most adjacent in the feature space, most of the samples belong to a category, the sample also belongs to this category. This method determines the category to which the sample is to be divided, depending on the category of the nearest one or several samples in the classification decision.

Iii. Advantages and Disadvantages

1. Advantages:

1) High precision, insensitive to outliers, no data input assumptions;

2) KNN algorithm itself is simple and effective, it is a lazy-learning algorithm, classifier does not need to use training set for training, training time complexity of 0.

3) Since the KNN method mainly relies on the surrounding limited adjacent samples, rather than the discriminant domain method to determine the category, so for the class domain intersection or overlap more than the sample set, the KNN method is more suitable than other methods.
4) KNN algorithm can be used not only for classification, but also for regression.

2. Disadvantages:

1) High computational complexity: The computational complexity of KNN classification is proportional to the number of documents in the training set, i.e., if the total number of documents in the training set is N, then the classification time complexity of KNN is O (n).

2) High spatial complexity: when the sample is unbalanced, such as a class sample capacity is very large, and other class sample capacity is very small, it is possible that when a new sample is entered, the sample's K neighbors in the bulk class sample accounted for a majority.

Iv. issues of attention

1, the choice of K

The choice of K values can have a significant impact on the results of the algorithm. A smaller k value means that only training instances that are closer to the input instance will work on the predicted results, but are prone to overfitting; if the K value is large, the advantage is that it can reduce the learning estimation error, but the disadvantage is that the approximate error of learning increases, and the training instance which is farther from the input instance will also work on the prediction. In practice, K values generally choose a smaller value, usually using a cross-validation method to select the most K-value. With the number of training instances tending to infinity and k=1, the error rate does not exceed twice times the Bayesian error rate, and if K is also inclined to infinity, the error rate tends to be Bayesian error rate.

2. Distance measurement

Commonly used Euclidean distance, Markov distance, angle cosine distance and so on.

3. Classification Sample Balance

When the sample is unbalanced, such as a class with a large sample capacity, and other classes of sample capacity is very small, it is possible that when a new sample is entered, the sample of the K-neighbor of the samples of the large-capacity class majority. At this point, you can compress the data categories with more samples, or use the weighting coefficients to judge which category the test points belong to.

Five, the algorithm steps

1) Calculation Distance: Calculates the distance from the point in the data set of the known category to the current point;
2) Find a neighbor: Find the nearest K training object, as the nearest neighbor of the test object
3) Do classification: According to the category of the highest frequency of the K nearest neighbor as the prediction classification of the test object

Vi. Sample Code (Python)

From numpy Import *
Import operator

def classify0 (inx,dataset,labels,k):
Datasetsize = dataset.shape[0]
print ' datasetsize= ', datasetsize
Diffmat = Tile (InX, (datasetsize,1))-DataSet
Print tile (InX, (datasetsize,1))
print ' diffmat= ', Diffmat
Sqdiffmat = diffmat**2
print ' sqdiffmat= ', Sqdiffmat
Sqdistances = Sqdiffmat.sum (Axis=1)
print ' sqdistances= ', sqdistances
distances = sqdistances**0.5
print ' distances= ', distances
Sorteddistindicies = Distances.argsort ()
Print Sorteddistindicies
ClassCount = {}
For I in range (k):
Voteilabel = Labels[sorteddistindicies[i]]
Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1
Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true)
Print Sortedclasscount
return sortedclasscount[0][0]

Def createdataset ():
Group = Array ([[[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
Labels = [' A ', ' a ', ' B ', ' B ']
Return Group,labels

Group,labels = CreateDataSet ()
Print classify0 ([0,0],group,labels,3)

VII. Algorithm Industry Applications

Eight, the related improvement of the algorithm

KD tree, sample compression technology

Ix. Reference Documents

Http://www.stanford.edu/~hastie/Papers/dann_IEEE.pdf

Machine Learning II: K-Nearest neighbor (KNN) algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Machine Learning II: K-Nearest neighbor (KNN) algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Machine Learning II: K-Nearest neighbor (KNN) algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support