K-Nearest Neighbor Algorithm (K-nearest Neighbor), k-Nearest

Last Update:2018-02-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

K-Nearest Neighbor Algorithm (K-nearest Neighbor), k-Nearest
I. Overview

　　The k-nn algorithm uses the distance measurement method between different feature values for classification.

1. Working principle:

There is a sample data set, also known as a training sample set, and each data in the sample set has tags, that is, we know the correspondence between each data in the sample set and its category. After entering new data without tags, compare each feature of the new data with the feature corresponding to the data in the sample set, and then extract the classification tag of the most similar data (nearest neighbor) in the sample set. Finally, select the k categories with the most frequent occurrences of the most similar data as the classification of the new data.

　　UsuallyK is an integer not greater than 20. Generally, to facilitate the use of a minority to obey the Majority of voting rules (Majority-voting), k is the prime number.

2. Example Analysis: Movie Classification

First, we extract two features from action movies and love movies-fighting and kissing. Statistics on the two features of the six movies and the unknown movies are as follows:

Figure 1 fighting and kissing feature statistics

In this way, we can abstract seven movies into seven points in a two-dimensional coordinate system, and abstract the two features into the X and Y coordinate values of the corresponding points, for example:

Figure 2: abstracted feature data

Then we can use scatter plots to represent the abstract data:

Figure 3: scatter plot of movie Classification

In this case, we need to calculate the distance between different feature values, that is, the distance between the yellow point and other points in figure 3. Here we use the commonly used Euclidean Distance formula (Euclidean Distance)

(Other algorithms can be used for distance calculation .)
After calculation, we can obtain the following data:

Table 1: distance between a known movie and an unknown movie
Movie name	Movie type	Distance from an unknown movie
California Man	Romance	20.5
He's Not Really into Dudes	Romance	18.7
Beautiful Woman	Romance	19.2
Kevin Longblade	Action	115.3
Robo Slayer 3000	Action	117.4
Amped II	Action	118.9

If k is 3, we take the three points with the minimum distance value. Among the three vertices, there are 3 Romance types and 0 Action types, so the highest occurrence frequency of the Romance type is. Therefore, we determine that unknown movies belong to the Romance type.

3. pseudocode of KNN classification algorithm:

Perform the following operations on each vertex in the dataset with an unknown category attribute:
(1) calculate the distance between a point in a dataset of known classes and the current point;
(2) sort by ascending distance;
(3) Select k points with the minimum distance from the current point;
(4) determine the frequency of occurrence of the category of the first k points;
(5) return the category with the highest frequency of occurrence of the first k points as the prediction category of the current point.

4. algorithm advantages and disadvantages Advantages:

The algorithm is simple and easy to implement. It is not sensitive to abnormal values.

Disadvantages:

　　High space complexity
Requires a large amount of space to store all known instances
　High computing complexity
Compare all known instances and instances to be classified

Ii. Example: Handwriting Recognition System

The program runs on python3.6.

1 #-*-coding: UTF-8-*-2 3 from numpy import * 4 import operator 5 from OS import listdir 6 7 def classify (outputs, dataSet, labels, k ): 8 "9: param labels: Sample Data 10: param dataSet: Known Data 11: param labels: Classification label of known data 12: param k: Selected k value 13: return: returns the classification label 14 "15 dataSetSize = dataSet for sample data. shape [0] # obtain the number of matrix rows 16 17 # Calculate the Euclidean distance 18 diffMat = tile (partition, (dataSetSize, 1)-dataSet19 sqDiffMat = diffMat ** 220 sqDistances = s QDiffMat. sum (axis = 1) 21 distances = sqDistances ** 0.522 23 sortedDistIndicies = distances. argsort () # Sort indexes (from small to large) 24 classCount = {} 25 26 # select the smallest k points 27 for I in range (k ): 28 voteIlabel = labels [sortedDistIndicies [I] 29 classCount [voteIlabel] = classCount. get (voteIlabel, 0) + 130 31 sortedClassCount = sorted (classCount. items (), 32 key = operator. itemgetter (1), reverse = True) 33 34 return sortedClassCount [0] [0] 35 36 37 def img2vector (filename): 38 "" 39: param filename: name of the input file, used to obtain text data 40: return: return text data in array format 41 "42 returnVect = zeros (1, 1024) 43 fr = open (filename) 44 for I in range (32 ): 45 lineStr = fr. readline () 46 for j in range (32): 47 returnVect [0, 32 * I + j] = int (lineStr [j]) 48 return returnVect49 50 def handwritingClassTest (): 51 hwLabels = [] 52 trainingFileList = listdir ('trainingdigits ') # Get the content in the directory (File Name () 53 m = len (trainingFileList) 54 trainingMat = zeros (m, 1024) 55 for I in range (m ): 56 fileNameStr = trainingFileList [I] 57 fileStr = fileNameStr. split ('. ') [0] 58 classNumStr = int (fileStr. split ('_') [0]) 59 hwLabels. append (classNumStr) 60 trainingMat [I,:] = img2vector ('trainingdigits/% s' % fileNameStr) 61 testFileList = listdir ('testdigits ') 62 errorCount = 0.063 mTest = len (testFileList) 64 for I in ran Ge (mTest): 65''' 66 parses the file name 67 the file name format used in this program is: 68 bytes number _ number .txt 69 '''70 fileNameStr = testFileList [I] 71 fileStr = fileNameStr. split ('. ') [0] 72 classNumStr = int (fileStr. split ('_') [0]) 73 74 vectorUnderTest = img2vector ('testdigits/% s' % fileNameStr) # input test data 75 # classify test data 76 classifierResult = classify (vectorUnderTest, 77 trainingMat, hwLabels, 3) 78 print ("the classifier came back with: % d, the real answe R is: % d "\ 79% (classifierResult, classNumStr) 80 if (classifierResult! = ClassNumStr): errorCount + = 1.081 print ("the total number of errors is: % d" % errorCount) 82 print ("the total error rate is: % f "% (errorCount/float (mTest )))

Running result:

We can see that the k-Nearest Neighbor algorithm is used to recognize handwritten numbers, with an error rate of 1.4%.

Iii. Summary

KNN is a classification algorithm in machine learning and belongs to supervised learning. It is the simplest and most effective algorithm for data classification. However, the execution efficiency is low and the operation is time-consuming.

References:
Machine learning practices

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

K-Nearest Neighbor Algorithm (K-nearest Neighbor), k-Nearest

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

K-Nearest Neighbor Algorithm (K-nearest Neighbor), k-Nearest

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support