A brief introduction to K-Nearest-neighbor (KNN) algorithm flow __ algorithm

Source: Internet
Author: User
Tags prepare ranges
First, the algorithm flow

(1) Data collection: Any method can be used;

(2) Preparation of data: distance calculation of the required value, preferably structured data format;

(3) Analysis data: can use any method;

(4) Training algorithm: This step does not apply to K-nearest neighbor algorithm;

(5) Test algorithm: Calculate error rate;

(6) Using the algorithm: first, we need to input sample data and structured output, and then run K-Nearest neighbor algorithm to determine the input data belong to which classification, the last application, the calculation of the classification of the follow-up processing.

Second, the implementation of the algorithm

For each point in the dataset for an unknown category attribute, do the following:

(1) Compute the distance between the point in the known class dataset and the current point;

(2) sorted by distance increment order;

(3) Selecting the K point with the minimum distance from the current point;

(4) Determine the occurrence frequency of the first K-point category;

(5) Returns the category with the highest frequency of the first K points as the forecast classification of the current point.

Three, code detailed

(Python development environment, including installation of numpy,scipy,matplotlib and other Scientific Computing library installation no longer repeat, Baidu can)

(1) into the Python interactive development environment, write and save the following code, this document in the code saved as "KNN";

Import numpy
import operator from
OS import listdir from
numpy import *
 #k-nearest neighbor Algorithm def classify0 (InX, DataSet, labels, k): # Type: (Object, Object, object, object)-> obj
    ECT datasetsize = dataset.shape[0] #计算距离, using Euclidean distance.
    Diffmat = Numpy.tile (InX, (datasetsize, 1))-DataSet Sqdiffmat = diffmat**2 sqdistances = sqdiffmat.sum (Axis=1)                                           distances = sqdistances**0.5 sorteddistindicies = Distances.argsort () ClassCount = {} #选择距离最小的k个点 for I in range (k): Voteilabel = labels[sorteddistindicies[i]] classcount[ Voteilabel] = classcount.get (Voteilabel, 0) + 1 Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemget ter (1), reverse=true) #排序 return sortedclasscount[0][0] 
#编写基本的通用函数 def createdataset (): group = Array ([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]]) labels = [' A ', ' a ', ' B ', ' B '] return group, labels #将文本记录转换为numpy的解析程序 def file2matrix (filename): FR = open (filename) NumberOfLines =        Len (Fr.readlines ()) #get the number of lines in the file Returnmat = Numpy.zeros ((numberoflines, 3)
    #prepare matrix to returns to create the returned numpy matrices classlabelvector = [] #prepare labels return parse file data list FR = open (filename) index = 0 for line in Fr.readlines (): line = Line.strip () Listfromline = Li Ne.split (' t ') returnmat[index,:] = Listfromline[0:3] Classlabelvector.append (Numpy.int (listfromline[-1))
    Index + + 1 return returnmat, Classlabelvector #归一化特征值 def autonorm (dataSet): minvals = dataset.min (0) Maxvals = Dataset.max (0) ranges = maxvals-minvals m = dataset.shape[0] Normdataset = Dataset-numpy.tile ( Minvals, (M, 1)) norMdataset = Normdataset/numpy.tile (ranges, (M, 1)) #element wise divide eigenvalues divide return normdataset, ranges, minvals ' ' Def datingclasstest (): HoRatio = 0.50 #hold out 10% datingdatamat,datinglabels = File2matrix (' Datingtestset
    2.txt ') #load data setfrom file Normmat, ranges, minvals = Autonorm (datingdatamat) m = normmat.shape[0] Numtestvecs = Numpy.int (M * hoRatio) Errorcount = 0.0 for i in range (numtestvecs): Classifierresult = Clas Sify0 (normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3) print "The classifier came back with:% D, the real answer is:%d "% (Classifierresult, datinglabels[i]) if Classifierresult!= datinglabels[i]: Errorcoun T + + 1.0 print "The total error rate is:%f"% (Errorcount/numpy.float (numtestvecs)) print Errorcount "" #将图像转换为
        Vector def img2vector (filename): Returnvect = Numpy.zeros ((1, 1024)) FR = open (filename) for I in range (32): Linestr = Fr.readline() for J in Range (K): returnvect[0,32*i+j] = Numpy.int (Linestr[j]) return Returnvect #手写数字识别系统的测试 Code def handwritingclasstest (): Hwlabels = [] trainingfilelist = Listdir (' trainingdigits ') #load the Trai Ning set get directory content m = Len (trainingfilelist) Trainingmat = Numpy.zeros ((m, 1024)) for I in Range (m): Filen Amestr = trainingfilelist[i] Filestr = Filenamestr.split ('. ') [0] #take off. txt resolves the classification number from the filename classnumstr = numpy.int (Filestr.split ('_') [0]) Hwlabels.append (classnu        MSTR) trainingmat[i,:] = Img2vector (' trainingdigits/%s '% filenamestr) testfilelist = Listdir (' testdigits ')
        #iterate through the test set errorcount = 0.0 mtest = Len (testfilelist) for I in Range (mtest): Filenamestr = testfilelist[i] Filestr = Filenamestr.split ('. ') [0] #take off. txt classnumstr = numpy.int (Filestr.split ('_') [0]) Vectorundertest = Img2vector (' Testdi gits/%s'% filenamestr Classifierresult = classify0 (Vectorundertest, Trainingmat, Hwlabels, 3) print "The Classi Fier came back with:%d, the real answer is:%d "% (Classifierresult, classnumstr) if Classifierresult!= Str:errorcount + + 1.0 print ' Total number of errors are:%d '% errorcount print ' total error rate ':%f '
 % (Errorcount/numpy.float (mtest))
(2) Python interactive interface Enter the following command to import the program module edited above.
>>>import KNN
>>>group,labels=knn.createdataset ()
(3) Analysis data: Using Matplotlib to create a scatter chart
>>>import matplotlib
>>>import matplotlib.pyplot as Plot
>>>fig=plt.figure ()
>>>ax=fig.add_subplot (>>>ax.scatter)
(datingdatamat[:,1],datingdatamat[:,2],15.0 *array (Datinglabels), 15.0*array (Datinglabels))
(4) Test output results
Knn.handwritingclasstest ()
The advantages and disadvantages of the algorithm are simple and effective, high precision, insensitive to abnormal value, no data input assumptions. The computational complexity is high, the space complexity is high; Since the distance value must be computed for each data set, it can be very time-consuming to use, and it cannot give the infrastructure information of any data, so we cannot know what the typical instance sample of an average instance sample is.

Note: Similar data such as DatingTestSet2.txt in your code needs to be added to the same path as the code.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.