K-Nearest Neighbor Algorithm (K-nearest Neighbor), k-Nearest

Source: Internet
Author: User

K-Nearest Neighbor Algorithm (K-nearest Neighbor), k-Nearest
I. Overview

  The k-nn algorithm uses the distance measurement method between different feature values for classification.

1. Working principle:

There is a sample data set, also known as a training sample set, and each data in the sample set has tags, that is, we know the correspondence between each data in the sample set and its category. After entering new data without tags, compare each feature of the new data with the feature corresponding to the data in the sample set, and then extract the classification tag of the most similar data (nearest neighbor) in the sample set. Finally, select the k categories with the most frequent occurrences of the most similar data as the classification of the new data.

  UsuallyK is an integer not greater than 20. Generally, to facilitate the use of a minority to obey the Majority of voting rules (Majority-voting), k is the prime number.

2. Example Analysis: Movie Classification

First, we extract two features from action movies and love movies-fighting and kissing. Statistics on the two features of the six movies and the unknown movies are as follows:

 

Figure 1 fighting and kissing feature statistics

In this way, we can abstract seven movies into seven points in a two-dimensional coordinate system, and abstract the two features into the X and Y coordinate values of the corresponding points, for example:

 

Figure 2: abstracted feature data

Then we can use scatter plots to represent the abstract data:

 

Figure 3: scatter plot of movie Classification

In this case, we need to calculate the distance between different feature values, that is, the distance between the yellow point and other points in figure 3. Here we use the commonly used Euclidean Distance formula (Euclidean Distance)

 

(Other algorithms can be used for distance calculation .)
After calculation, we can obtain the following data:

Table 1: distance between a known movie and an unknown movie

Movie name

Movie type

Distance from an unknown movie

California Man

Romance

20.5

He's Not Really into Dudes

Romance

18.7

Beautiful Woman

Romance

19.2

Kevin Longblade

Action

115.3

Robo Slayer 3000

Action

117.4

Amped II

Action

118.9

 

If k is 3, we take the three points with the minimum distance value. Among the three vertices, there are 3 Romance types and 0 Action types, so the highest occurrence frequency of the Romance type is. Therefore, we determine that unknown movies belong to the Romance type.

3. pseudocode of KNN classification algorithm:

Perform the following operations on each vertex in the dataset with an unknown category attribute:
(1) calculate the distance between a point in a dataset of known classes and the current point;
(2) sort by ascending distance;
(3) Select k points with the minimum distance from the current point;
(4) determine the frequency of occurrence of the category of the first k points;
(5) return the category with the highest frequency of occurrence of the first k points as the prediction category of the current point.

4. algorithm advantages and disadvantages Advantages:

The algorithm is simple and easy to implement. It is not sensitive to abnormal values.

Disadvantages:

  High space complexity
Requires a large amount of space to store all known instances
 High computing complexity
Compare all known instances and instances to be classified

Ii. Example: Handwriting Recognition System

The program runs on python3.6.

1 #-*-coding: UTF-8-*-2 3 from numpy import * 4 import operator 5 from OS import listdir 6 7 def classify (outputs, dataSet, labels, k ): 8 "9: param labels: Sample Data 10: param dataSet: Known Data 11: param labels: Classification label of known data 12: param k: Selected k value 13: return: returns the classification label 14 "15 dataSetSize = dataSet for sample data. shape [0] # obtain the number of matrix rows 16 17 # Calculate the Euclidean distance 18 diffMat = tile (partition, (dataSetSize, 1)-dataSet19 sqDiffMat = diffMat ** 220 sqDistances = s QDiffMat. sum (axis = 1) 21 distances = sqDistances ** 0.522 23 sortedDistIndicies = distances. argsort () # Sort indexes (from small to large) 24 classCount = {} 25 26 # select the smallest k points 27 for I in range (k ): 28 voteIlabel = labels [sortedDistIndicies [I] 29 classCount [voteIlabel] = classCount. get (voteIlabel, 0) + 130 31 sortedClassCount = sorted (classCount. items (), 32 key = operator. itemgetter (1), reverse = True) 33 34 return sortedClassCount [0] [0] 35 36 37 def img2vector (filename): 38 "" 39: param filename: name of the input file, used to obtain text data 40: return: return text data in array format 41 "42 returnVect = zeros (1, 1024) 43 fr = open (filename) 44 for I in range (32 ): 45 lineStr = fr. readline () 46 for j in range (32): 47 returnVect [0, 32 * I + j] = int (lineStr [j]) 48 return returnVect49 50 def handwritingClassTest (): 51 hwLabels = [] 52 trainingFileList = listdir ('trainingdigits ') # Get the content in the directory (File Name () 53 m = len (trainingFileList) 54 trainingMat = zeros (m, 1024) 55 for I in range (m ): 56 fileNameStr = trainingFileList [I] 57 fileStr = fileNameStr. split ('. ') [0] 58 classNumStr = int (fileStr. split ('_') [0]) 59 hwLabels. append (classNumStr) 60 trainingMat [I,:] = img2vector ('trainingdigits/% s' % fileNameStr) 61 testFileList = listdir ('testdigits ') 62 errorCount = 0.063 mTest = len (testFileList) 64 for I in ran Ge (mTest): 65''' 66 parses the file name 67 the file name format used in this program is: 68 bytes number _ number .txt 69 '''70 fileNameStr = testFileList [I] 71 fileStr = fileNameStr. split ('. ') [0] 72 classNumStr = int (fileStr. split ('_') [0]) 73 74 vectorUnderTest = img2vector ('testdigits/% s' % fileNameStr) # input test data 75 # classify test data 76 classifierResult = classify (vectorUnderTest, 77 trainingMat, hwLabels, 3) 78 print ("the classifier came back with: % d, the real answe R is: % d "\ 79% (classifierResult, classNumStr) 80 if (classifierResult! = ClassNumStr): errorCount + = 1.081 print ("the total number of errors is: % d" % errorCount) 82 print ("the total error rate is: % f "% (errorCount/float (mTest )))

Running result:

 

We can see that the k-Nearest Neighbor algorithm is used to recognize handwritten numbers, with an error rate of 1.4%.

Iii. Summary

KNN is a classification algorithm in machine learning and belongs to supervised learning. It is the simplest and most effective algorithm for data classification. However, the execution efficiency is low and the operation is time-consuming.

 

References:
Machine learning practices

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.