K-Nearest Neighbor algorithm (K-nearest Neighbor)

Last Update:2018-02-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Overview

　　The K-Nearest neighbor algorithm uses the distance method of measuring different eigenvalues to classify

1, working principle:

　　There is a collection of sample data, also called a training sample set, and there is a label for each data in the sample set, that is, we know the correspondence between each data in the sample set and the owning category. After entering new data without a label, each feature of the new data is compared to the feature in the sample set, and then the algorithm extracts the category label of the most similar data (nearest neighbor) in the sample set. Finally, select the most frequently occurring classification of the K most similar data as the classification of the new data.

usually　　 K takes no more than 20 integers, generally in order to facilitate the use of the minority obey the majority of the voting law (majority-voting), K take prime numbers.

2, example analysis: Film Classification

First, we extract two features from action movies and love movies.　　 --fighting and kissing. And the two characteristics of the known types of 6 movies and unknown types of movies are as follows:

Fig . 1 statistics of fighting and kissing characteristics

so we can put　　 7 movies are abstracted as 7 points in a two-dimensional coordinate system, and two features are respectively abstracted as x-coordinate and y-coordinate values of the corresponding points, such as:

Figure 2: Post-abstract feature data

Then it can be represented by a scatter plot based on the data obtained from the abstraction:

Figure 3: Movie Classification scatter plot

then we need to calculate the distance between the different eigenvalues, that is, the graph　　 3 The distance between the yellow point and the other points. Here we use the more commonly used Euclidean distance formula (Euclidean Distance)

(for calculation of distances, other algorithms can also be used.) ）
By calculating we get the following data:

Table 1: Distance between a known movie and an unknown movie
Movie Name	Movie Type	The distance from the unknown movie
California Mans	Romance	20.5
He ' s not really into Dudes	Romance	18.7
Beautiful Woman	Romance	19.2
Kevin Longblade	Action	115.3
Robo Slayer 3000	Action	117.4
Amped II	Action	118.9

If k=3, we take the 3 points with the lowest distance value. There are 3 romance types in these 3 points and 0 action types, so the romance type is the most frequent. So we decided that the unknown kind of movie belongs to the romance type.

3, KNN classification algorithm pseudo-code:

For each point in the dataset of the Unknown category property, do the following:
(1) Calculate the distance between the point in the data set of the known category and the current point;
(2) sorting in ascending order of distance;
(3) Select K points with the minimum distance from the current point;
(4) Determine the frequency of occurrence of the category of the first k points;
(5) returns the category with the highest frequency of the first K points as the predicted classification of the current point.

4. Advantages and disadvantages of algorithm Advantages:

simple algorithm, easy to implement　　 ; not sensitive to outliers.

Disadvantages:

　　High degree of space complexity
Requires a lot of space to store all known instances
　High computational complexity
Need to compare all known instances with the instances to classify

Second, example: handwriting recognition system

program running in python3.6

1 #-*-coding:utf-8-*-2 3  fromNumPyImport*4 Importoperator5  fromOsImportListdir6 7 defclassify (InX, dataSet, labels, k):8     """9 :p Aram InX: Sample DataTen :p Aram DataSet: Known data One :p Aram Labels: category labels for known data A :p Aram K: Selected K-Value - : Return: Returns the category label of the sample data -     """ theDatasetsize = dataset.shape[0]#gets the number of matrix rows -  -     #Calculate Euclidean distance -Diffmat = Tile (InX, (datasetsize, 1))-DataSet +Sqdiffmat = diffmat**2 -Sqdistances = Sqdiffmat.sum (Axis=1) +distances = sqdistances**0.5 A  atSorteddistindicies = Distances.argsort ()#sort the index (from small to large) -Classcount={} -  -     #Select K-Points with a minimum distance -      forIinchRange (k): -Voteilabel =Labels[sorteddistindicies[i]] inClasscount[voteilabel] = Classcount.get (voteilabel,0) + 1 -  toSortedclasscount =Sorted (Classcount.items (), +Key=operator.itemgetter (1), reverse=True) -  the         returnSortedclasscount[0][0] *  $ Panax Notoginseng defimg2vector (filename): -     """ the :p Aram FileName: Enter a file name for getting text data + : Return: Returns the text data as an array A     """ theReturnvect = Zeros ((1, 1024)) +FR =open (filename) -      forIinchRange (32): $Linestr =Fr.readline () $          forJinchRange (32): -RETURNVECT[0,32*I+J] =Int (linestr[j]) -     returnReturnvect the  - defhandwritingclasstest ():WuyiHwlabels = [] theTrainingfilelist = Listdir ('trainingdigits')#get the contents of the directory (file name) -m =Len (trainingfilelist) WuTrainingmat = Zeros ((M, 1024)) -      forIinchRange (m): AboutFilenamestr =Trainingfilelist[i] $Filestr = Filenamestr.split ('.') [0] -classnumstr = Int (Filestr.split ('_') [0]) - hwlabels.append (CLASSNUMSTR) -Trainingmat[i,:] = Img2vector ('trainingdigits/%s'%filenamestr) ATestfilelist = Listdir ('testdigits') +Errorcount = 0.0 theMtest =Len (testfilelist) -      forIinchRange (mtest): $         " " the parsing the name of a file the the file name format used in this program is: the correct number _ number. txt the         " " -Filenamestr =Testfilelist[i] inFilestr = Filenamestr.split ('.') [0] theclassnumstr = Int (Filestr.split ('_') [0]) the  AboutVectorundertest = Img2vector ('testdigits/%s'% filenamestr)#Input test Data the         #classify the test data theClassifierresult =classify (vectorundertest, theTrainingmat, Hwlabels, 3) +         Print("The classifier came back with:%d, the real answer is:%d"  -%(Classifierresult, classnumstr)) the         if(Classifierresult! = classnumstr): Errorcount + = 1.0Bayi     Print("The total number of errors is:%d"%errorcount) the     Print("The total error rate is:%f"% (Errorcount/float (mtest)))

Operation Result:

we can see　 the K-Nearest neighbor algorithm recognizes the handwritten numeral program with an error rate of 1.4%.

Iii. Summary

　　KNN algorithm is a kind of classification algorithm in machine learning, which belongs to supervised learning. Is the simplest and most efficient algorithm for classifying data. But execution is inefficient and runs very time-consuming.

Resources:
"Machine learning Combat"

K-Nearest Neighbor algorithm (K-nearest Neighbor)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

K-Nearest Neighbor algorithm (K-nearest Neighbor)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

K-Nearest Neighbor algorithm (K-nearest Neighbor)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support