Machine Learning II: K-Nearest neighbor (KNN) algorithm

Source: Internet
Author: User

I. Overview

K Nearest neighbor (k-nearest NEIGHBOR,KNN) classification algorithm is a theoretically mature method and one of the simplest machine learning algorithms. The idea of this approach is that if a sample is in the K most similar in the feature space (that is, the nearest neighbor in the feature space) Most of the samples belong to a category, then the sample belongs to that category. In the KNN algorithm, the selected neighbors are the objects that have been correctly categorized. This method determines the category to which the sample is to be divided based on the category of the nearest one or several samples in the categorical decision-making. Although the KNN method relies on the limit theorem in principle, it is only related to a very small number of adjacent samples in the class decision. The KNN method is more suitable than other methods because the KNN method mainly relies on the surrounding finite sample, rather than the Discriminant class domain method to determine the category of the class.

Second, the principle

The basic idea is that if a sample in the feature space of K is most similar to the most adjacent in the feature space, most of the samples belong to a category, the sample also belongs to this category. This method determines the category to which the sample is to be divided, depending on the category of the nearest one or several samples in the classification decision.

Iii. Advantages and Disadvantages

1. Advantages:

1) High precision, insensitive to outliers, no data input assumptions;

2) KNN algorithm itself is simple and effective, it is a lazy-learning algorithm, classifier does not need to use training set for training, training time complexity of 0.

3) Since the KNN method mainly relies on the surrounding limited adjacent samples, rather than the discriminant domain method to determine the category, so for the class domain intersection or overlap more than the sample set, the KNN method is more suitable than other methods.
4) KNN algorithm can be used not only for classification, but also for regression.

2. Disadvantages:

1) High computational complexity: The computational complexity of KNN classification is proportional to the number of documents in the training set, i.e., if the total number of documents in the training set is N, then the classification time complexity of KNN is O (n).

2) High spatial complexity: when the sample is unbalanced, such as a class sample capacity is very large, and other class sample capacity is very small, it is possible that when a new sample is entered, the sample's K neighbors in the bulk class sample accounted for a majority.

Iv. issues of attention

1, the choice of K

The choice of K values can have a significant impact on the results of the algorithm. A smaller k value means that only training instances that are closer to the input instance will work on the predicted results, but are prone to overfitting; if the K value is large, the advantage is that it can reduce the learning estimation error, but the disadvantage is that the approximate error of learning increases, and the training instance which is farther from the input instance will also work on the prediction. In practice, K values generally choose a smaller value, usually using a cross-validation method to select the most K-value. With the number of training instances tending to infinity and k=1, the error rate does not exceed twice times the Bayesian error rate, and if K is also inclined to infinity, the error rate tends to be Bayesian error rate.

2. Distance measurement

Commonly used Euclidean distance, Markov distance, angle cosine distance and so on.

3. Classification Sample Balance

When the sample is unbalanced, such as a class with a large sample capacity, and other classes of sample capacity is very small, it is possible that when a new sample is entered, the sample of the K-neighbor of the samples of the large-capacity class majority. At this point, you can compress the data categories with more samples, or use the weighting coefficients to judge which category the test points belong to.

Five, the algorithm steps

1) Calculation Distance: Calculates the distance from the point in the data set of the known category to the current point;
2) Find a neighbor: Find the nearest K training object, as the nearest neighbor of the test object
3) Do classification: According to the category of the highest frequency of the K nearest neighbor as the prediction classification of the test object

Vi. Sample Code (Python)

From numpy Import *
Import operator

def classify0 (inx,dataset,labels,k):
Datasetsize = dataset.shape[0]
print ' datasetsize= ', datasetsize
Diffmat = Tile (InX, (datasetsize,1))-DataSet
Print tile (InX, (datasetsize,1))
print ' diffmat= ', Diffmat
Sqdiffmat = diffmat**2
print ' sqdiffmat= ', Sqdiffmat
Sqdistances = Sqdiffmat.sum (Axis=1)
print ' sqdistances= ', sqdistances
distances = sqdistances**0.5
print ' distances= ', distances
Sorteddistindicies = Distances.argsort ()
Print Sorteddistindicies
ClassCount = {}
For I in range (k):
Voteilabel = Labels[sorteddistindicies[i]]
Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1
Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true)
Print Sortedclasscount
return sortedclasscount[0][0]

Def createdataset ():
Group = Array ([[[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
Labels = [' A ', ' a ', ' B ', ' B ']
Return Group,labels

Group,labels = CreateDataSet ()
Print classify0 ([0,0],group,labels,3)

VII. Algorithm Industry Applications

Eight, the related improvement of the algorithm

KD tree, sample compression technology

Ix. Reference Documents

Http://www.stanford.edu/~hastie/Papers/dann_IEEE.pdf

Machine Learning II: K-Nearest neighbor (KNN) algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.