K-Nearest Neighbor Algorithm (K-nearest Neighbor), k-Nearest
I. Overview
The k-nn algorithm uses the distance measurement method between different feature values for classification.
1. Working principle:
There is a sample data set, also known as a training sample set, and each data in the sample set has tags, that is, we know the correspondence between each data in the sample set and its category. After entering new data without tags, compare each feature of the new data with the feature corresponding to the data in the sample set, and then extract the classification tag of the most similar data (nearest neighbor) in the sample set. Finally, select the k categories with the most frequent occurrences of the most similar data as the classification of the new data.
UsuallyK is an integer not greater than 20. Generally, to facilitate the use of a minority to obey the Majority of voting rules (Majority-voting), k is the prime number.
2. Example Analysis: Movie Classification
First, we extract two features from action movies and love movies-fighting and kissing. Statistics on the two features of the six movies and the unknown movies are as follows:
Figure 1 fighting and kissing feature statistics
In this way, we can abstract seven movies into seven points in a two-dimensional coordinate system, and abstract the two features into the X and Y coordinate values of the corresponding points, for example:
Figure 2: abstracted feature data
Then we can use scatter plots to represent the abstract data:
Figure 3: scatter plot of movie Classification
In this case, we need to calculate the distance between different feature values, that is, the distance between the yellow point and other points in figure 3. Here we use the commonly used Euclidean Distance formula (Euclidean Distance)
(Other algorithms can be used for distance calculation .)
After calculation, we can obtain the following data:
Table 1: distance between a known movie and an unknown movie |
Movie name |
Movie type |
Distance from an unknown movie |
California Man |
Romance |
20.5 |
He's Not Really into Dudes |
Romance |
18.7 |
Beautiful Woman |
Romance |
19.2 |
Kevin Longblade |
Action |
115.3 |
Robo Slayer 3000 |
Action |
117.4 |
Amped II |
Action |
118.9 |
If k is 3, we take the three points with the minimum distance value. Among the three vertices, there are 3 Romance types and 0 Action types, so the highest occurrence frequency of the Romance type is. Therefore, we determine that unknown movies belong to the Romance type.
3. pseudocode of KNN classification algorithm:
Perform the following operations on each vertex in the dataset with an unknown category attribute:
(1) calculate the distance between a point in a dataset of known classes and the current point;
(2) sort by ascending distance;
(3) Select k points with the minimum distance from the current point;
(4) determine the frequency of occurrence of the category of the first k points;
(5) return the category with the highest frequency of occurrence of the first k points as the prediction category of the current point.
4. algorithm advantages and disadvantages
Advantages:
The algorithm is simple and easy to implement. It is not sensitive to abnormal values.
Disadvantages:
High space complexity
Requires a large amount of space to store all known instances
High computing complexity
Compare all known instances and instances to be classified
Ii. Example: Handwriting Recognition System
The program runs on python3.6.
1 #-*-coding: UTF-8-*-2 3 from numpy import * 4 import operator 5 from OS import listdir 6 7 def classify (outputs, dataSet, labels, k ): 8 "9: param labels: Sample Data 10: param dataSet: Known Data 11: param labels: Classification label of known data 12: param k: Selected k value 13: return: returns the classification label 14 "15 dataSetSize = dataSet for sample data. shape [0] # obtain the number of matrix rows 16 17 # Calculate the Euclidean distance 18 diffMat = tile (partition, (dataSetSize, 1)-dataSet19 sqDiffMat = diffMat ** 220 sqDistances = s QDiffMat. sum (axis = 1) 21 distances = sqDistances ** 0.522 23 sortedDistIndicies = distances. argsort () # Sort indexes (from small to large) 24 classCount = {} 25 26 # select the smallest k points 27 for I in range (k ): 28 voteIlabel = labels [sortedDistIndicies [I] 29 classCount [voteIlabel] = classCount. get (voteIlabel, 0) + 130 31 sortedClassCount = sorted (classCount. items (), 32 key = operator. itemgetter (1), reverse = True) 33 34 return sortedClassCount [0] [0] 35 36 37 def img2vector (filename): 38 "" 39: param filename: name of the input file, used to obtain text data 40: return: return text data in array format 41 "42 returnVect = zeros (1, 1024) 43 fr = open (filename) 44 for I in range (32 ): 45 lineStr = fr. readline () 46 for j in range (32): 47 returnVect [0, 32 * I + j] = int (lineStr [j]) 48 return returnVect49 50 def handwritingClassTest (): 51 hwLabels = [] 52 trainingFileList = listdir ('trainingdigits ') # Get the content in the directory (File Name () 53 m = len (trainingFileList) 54 trainingMat = zeros (m, 1024) 55 for I in range (m ): 56 fileNameStr = trainingFileList [I] 57 fileStr = fileNameStr. split ('. ') [0] 58 classNumStr = int (fileStr. split ('_') [0]) 59 hwLabels. append (classNumStr) 60 trainingMat [I,:] = img2vector ('trainingdigits/% s' % fileNameStr) 61 testFileList = listdir ('testdigits ') 62 errorCount = 0.063 mTest = len (testFileList) 64 for I in ran Ge (mTest): 65''' 66 parses the file name 67 the file name format used in this program is: 68 bytes number _ number .txt 69 '''70 fileNameStr = testFileList [I] 71 fileStr = fileNameStr. split ('. ') [0] 72 classNumStr = int (fileStr. split ('_') [0]) 73 74 vectorUnderTest = img2vector ('testdigits/% s' % fileNameStr) # input test data 75 # classify test data 76 classifierResult = classify (vectorUnderTest, 77 trainingMat, hwLabels, 3) 78 print ("the classifier came back with: % d, the real answe R is: % d "\ 79% (classifierResult, classNumStr) 80 if (classifierResult! = ClassNumStr): errorCount + = 1.081 print ("the total number of errors is: % d" % errorCount) 82 print ("the total error rate is: % f "% (errorCount/float (mTest )))
Running result:
We can see that the k-Nearest Neighbor algorithm is used to recognize handwritten numbers, with an error rate of 1.4%.
Iii. Summary
KNN is a classification algorithm in machine learning and belongs to supervised learning. It is the simplest and most effective algorithm for data classification. However, the execution efficiency is low and the operation is time-consuming.
References:
Machine learning practices