K-Nearest Neighbor algorithm (K-nearest Neighbor)

Source: Internet
Author: User

I. Overview

  The K-Nearest neighbor algorithm uses the distance method of measuring different eigenvalues to classify

1, working principle:

  There is a collection of sample data, also called a training sample set, and there is a label for each data in the sample set, that is, we know the correspondence between each data in the sample set and the owning category. After entering new data without a label, each feature of the new data is compared to the feature in the sample set, and then the algorithm extracts the category label of the most similar data (nearest neighbor) in the sample set. Finally, select the most frequently occurring classification of the K most similar data as the classification of the new data.

usually   K takes no more than 20 integers, generally in order to facilitate the use of the minority obey the majority of the voting law (majority-voting), K take prime numbers.

2, example analysis: Film Classification

First, we extract two features from action movies and love movies.   --fighting and kissing. And the two characteristics of the known types of 6 movies and unknown types of movies are as follows:

Fig . 1 statistics of fighting and kissing characteristics

so we can put   7 movies are abstracted as 7 points in a two-dimensional coordinate system, and two features are respectively abstracted as x-coordinate and y-coordinate values of the corresponding points, such as:

Figure 2: Post-abstract feature data

Then it can be represented by a scatter plot based on the data obtained from the abstraction:

Figure 3: Movie Classification scatter plot

then we need to calculate the distance between the different eigenvalues, that is, the graph   3 The distance between the yellow point and the other points. Here we use the more commonly used Euclidean distance formula (Euclidean Distance)

(for calculation of distances, other algorithms can also be used.) )
By calculating we get the following data:

Table 1: Distance between a known movie and an unknown movie

Movie Name

Movie Type

The distance from the unknown movie

California Mans

Romance

20.5

He ' s not really into Dudes

Romance

18.7

Beautiful Woman

Romance

19.2

Kevin Longblade

Action

115.3

Robo Slayer 3000

Action

117.4

Amped II

Action

118.9

If k=3, we take the 3 points with the lowest distance value. There are 3 romance types in these 3 points and 0 action types, so the romance type is the most frequent. So we decided that the unknown kind of movie belongs to the romance type.

3, KNN classification algorithm pseudo-code:

For each point in the dataset of the Unknown category property, do the following:
(1) Calculate the distance between the point in the data set of the known category and the current point;
(2) sorting in ascending order of distance;
(3) Select K points with the minimum distance from the current point;
(4) Determine the frequency of occurrence of the category of the first k points;
(5) returns the category with the highest frequency of the first K points as the predicted classification of the current point.

4. Advantages and disadvantages of algorithm Advantages:

simple algorithm, easy to implement   ; not sensitive to outliers.

Disadvantages:

  High degree of space complexity
Requires a lot of space to store all known instances
 High computational complexity
Need to compare all known instances with the instances to classify

Second, example: handwriting recognition system

program running in python3.6

1 #-*-coding:utf-8-*-2 3  fromNumPyImport*4 Importoperator5  fromOsImportListdir6 7 defclassify (InX, dataSet, labels, k):8     """9 :p Aram InX: Sample DataTen :p Aram DataSet: Known data One :p Aram Labels: category labels for known data A :p Aram K: Selected K-Value - : Return: Returns the category label of the sample data -     """ theDatasetsize = dataset.shape[0]#gets the number of matrix rows -  -     #Calculate Euclidean distance -Diffmat = Tile (InX, (datasetsize, 1))-DataSet +Sqdiffmat = diffmat**2 -Sqdistances = Sqdiffmat.sum (Axis=1) +distances = sqdistances**0.5 A  atSorteddistindicies = Distances.argsort ()#sort the index (from small to large) -Classcount={} -  -     #Select K-Points with a minimum distance -      forIinchRange (k): -Voteilabel =Labels[sorteddistindicies[i]] inClasscount[voteilabel] = Classcount.get (voteilabel,0) + 1 -  toSortedclasscount =Sorted (Classcount.items (), +Key=operator.itemgetter (1), reverse=True) -  the         returnSortedclasscount[0][0] *  $ Panax Notoginseng defimg2vector (filename): -     """ the :p Aram FileName: Enter a file name for getting text data + : Return: Returns the text data as an array A     """ theReturnvect = Zeros ((1, 1024)) +FR =open (filename) -      forIinchRange (32): $Linestr =Fr.readline () $          forJinchRange (32): -RETURNVECT[0,32*I+J] =Int (linestr[j]) -     returnReturnvect the  - defhandwritingclasstest ():WuyiHwlabels = [] theTrainingfilelist = Listdir ('trainingdigits')#get the contents of the directory (file name) -m =Len (trainingfilelist) WuTrainingmat = Zeros ((M, 1024)) -      forIinchRange (m): AboutFilenamestr =Trainingfilelist[i] $Filestr = Filenamestr.split ('.') [0] -classnumstr = Int (Filestr.split ('_') [0]) - hwlabels.append (CLASSNUMSTR) -Trainingmat[i,:] = Img2vector ('trainingdigits/%s'%filenamestr) ATestfilelist = Listdir ('testdigits') +Errorcount = 0.0 theMtest =Len (testfilelist) -      forIinchRange (mtest): $         " " the parsing the name of a file the the file name format used in this program is: the correct number _ number. txt the         " " -Filenamestr =Testfilelist[i] inFilestr = Filenamestr.split ('.') [0] theclassnumstr = Int (Filestr.split ('_') [0]) the  AboutVectorundertest = Img2vector ('testdigits/%s'% filenamestr)#Input test Data the         #classify the test data theClassifierresult =classify (vectorundertest, theTrainingmat, Hwlabels, 3) +         Print("The classifier came back with:%d, the real answer is:%d"  -%(Classifierresult, classnumstr)) the         if(Classifierresult! = classnumstr): Errorcount + = 1.0Bayi     Print("The total number of errors is:%d"%errorcount) the     Print("The total error rate is:%f"% (Errorcount/float (mtest)))

Operation Result:

we can see  the K-Nearest neighbor algorithm recognizes the handwritten numeral program with an error rate of 1.4%.

Iii. Summary

  KNN algorithm is a kind of classification algorithm in machine learning, which belongs to supervised learning. Is the simplest and most efficient algorithm for classifying data. But execution is inefficient and runs very time-consuming.

Resources:
"Machine learning Combat"

K-Nearest Neighbor algorithm (K-nearest Neighbor)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.