I. Overview
The K-Nearest neighbor algorithm uses the distance method of measuring different eigenvalues to classify
1, working principle:
There is a collection of sample data, also called a training sample set, and there is a label for each data in the sample set, that is, we know the correspondence between each data in the sample set and the owning category. After entering new data without a label, each feature of the new data is compared to the feature in the sample set, and then the algorithm extracts the category label of the most similar data (nearest neighbor) in the sample set. Finally, select the most frequently occurring classification of the K most similar data as the classification of the new data.
usually K takes no more than 20 integers, generally in order to facilitate the use of the minority obey the majority of the voting law (majority-voting), K take prime numbers.
2, example analysis: Film Classification
First, we extract two features from action movies and love movies. --fighting and kissing. And the two characteristics of the known types of 6 movies and unknown types of movies are as follows:
Fig . 1 statistics of fighting and kissing characteristics
so we can put 7 movies are abstracted as 7 points in a two-dimensional coordinate system, and two features are respectively abstracted as x-coordinate and y-coordinate values of the corresponding points, such as:
Figure 2: Post-abstract feature data
Then it can be represented by a scatter plot based on the data obtained from the abstraction:
Figure 3: Movie Classification scatter plot
then we need to calculate the distance between the different eigenvalues, that is, the graph 3 The distance between the yellow point and the other points. Here we use the more commonly used Euclidean distance formula (Euclidean Distance)
(for calculation of distances, other algorithms can also be used.) )
By calculating we get the following data:
Table 1: Distance between a known movie and an unknown movie |
Movie Name |
Movie Type |
The distance from the unknown movie |
California Mans |
Romance |
20.5 |
He ' s not really into Dudes |
Romance |
18.7 |
Beautiful Woman |
Romance |
19.2 |
Kevin Longblade |
Action |
115.3 |
Robo Slayer 3000 |
Action |
117.4 |
Amped II |
Action |
118.9 |
If k=3, we take the 3 points with the lowest distance value. There are 3 romance types in these 3 points and 0 action types, so the romance type is the most frequent. So we decided that the unknown kind of movie belongs to the romance type.
3, KNN classification algorithm pseudo-code:
For each point in the dataset of the Unknown category property, do the following:
(1) Calculate the distance between the point in the data set of the known category and the current point;
(2) sorting in ascending order of distance;
(3) Select K points with the minimum distance from the current point;
(4) Determine the frequency of occurrence of the category of the first k points;
(5) returns the category with the highest frequency of the first K points as the predicted classification of the current point.
4. Advantages and disadvantages of algorithm
Advantages:
simple algorithm, easy to implement ; not sensitive to outliers.
Disadvantages:
High degree of space complexity
Requires a lot of space to store all known instances
High computational complexity
Need to compare all known instances with the instances to classify
Second, example: handwriting recognition system
program running in python3.6
1 #-*-coding:utf-8-*-2 3 fromNumPyImport*4 Importoperator5 fromOsImportListdir6 7 defclassify (InX, dataSet, labels, k):8 """9 :p Aram InX: Sample DataTen :p Aram DataSet: Known data One :p Aram Labels: category labels for known data A :p Aram K: Selected K-Value - : Return: Returns the category label of the sample data - """ theDatasetsize = dataset.shape[0]#gets the number of matrix rows - - #Calculate Euclidean distance -Diffmat = Tile (InX, (datasetsize, 1))-DataSet +Sqdiffmat = diffmat**2 -Sqdistances = Sqdiffmat.sum (Axis=1) +distances = sqdistances**0.5 A atSorteddistindicies = Distances.argsort ()#sort the index (from small to large) -Classcount={} - - #Select K-Points with a minimum distance - forIinchRange (k): -Voteilabel =Labels[sorteddistindicies[i]] inClasscount[voteilabel] = Classcount.get (voteilabel,0) + 1 - toSortedclasscount =Sorted (Classcount.items (), +Key=operator.itemgetter (1), reverse=True) - the returnSortedclasscount[0][0] * $ Panax Notoginseng defimg2vector (filename): - """ the :p Aram FileName: Enter a file name for getting text data + : Return: Returns the text data as an array A """ theReturnvect = Zeros ((1, 1024)) +FR =open (filename) - forIinchRange (32): $Linestr =Fr.readline () $ forJinchRange (32): -RETURNVECT[0,32*I+J] =Int (linestr[j]) - returnReturnvect the - defhandwritingclasstest ():WuyiHwlabels = [] theTrainingfilelist = Listdir ('trainingdigits')#get the contents of the directory (file name) -m =Len (trainingfilelist) WuTrainingmat = Zeros ((M, 1024)) - forIinchRange (m): AboutFilenamestr =Trainingfilelist[i] $Filestr = Filenamestr.split ('.') [0] -classnumstr = Int (Filestr.split ('_') [0]) - hwlabels.append (CLASSNUMSTR) -Trainingmat[i,:] = Img2vector ('trainingdigits/%s'%filenamestr) ATestfilelist = Listdir ('testdigits') +Errorcount = 0.0 theMtest =Len (testfilelist) - forIinchRange (mtest): $ " " the parsing the name of a file the the file name format used in this program is: the correct number _ number. txt the " " -Filenamestr =Testfilelist[i] inFilestr = Filenamestr.split ('.') [0] theclassnumstr = Int (Filestr.split ('_') [0]) the AboutVectorundertest = Img2vector ('testdigits/%s'% filenamestr)#Input test Data the #classify the test data theClassifierresult =classify (vectorundertest, theTrainingmat, Hwlabels, 3) + Print("The classifier came back with:%d, the real answer is:%d" -%(Classifierresult, classnumstr)) the if(Classifierresult! = classnumstr): Errorcount + = 1.0Bayi Print("The total number of errors is:%d"%errorcount) the Print("The total error rate is:%f"% (Errorcount/float (mtest)))
Operation Result:
we can see the K-Nearest neighbor algorithm recognizes the handwritten numeral program with an error rate of 1.4%.
Iii. Summary
KNN algorithm is a kind of classification algorithm in machine learning, which belongs to supervised learning. Is the simplest and most efficient algorithm for classifying data. But execution is inefficient and runs very time-consuming.
Resources:
"Machine learning Combat"
K-Nearest Neighbor algorithm (K-nearest Neighbor)