"Machine learning" KNN algorithm

Source: Internet
Author: User

At the time of learning the basic knowledge of machine learning, will read the contents of the book to remember, this blog code reference book machine learning in Action ("Robot Learning Combat").

I. Overview

KNN algorithm is also called K - nearest neighbor classification (k-nearest neighbor classification) algorithm.

The KNN algorithm finds the K records closest to the new data from the training set , and then decides the categories of the new data according to their main classification. The algorithm involves 3 main factors: training set, distance or similar measure,k size.

Ii. key points of the algorithm

1, guiding ideology
Knnthe guideline of the algorithm is"Jinzhuzhechi, Howl", by your neighbor to infer your category.

The calculation steps are as follows:
1Distance: Given the test object, calculate its distance from each object in the training set
2Looking for neighbors: the nearestka training object, as a neighbor of the test object
3) do classification: according to thekThe main category of the nearest neighbor, to classify the test object

2, distance, or similarity measurement
What is the right distance measurement? The closer the distance should mean the greater the likelihood that these two points belong to a classification.
Distance measurement includes Euclidean distance, angle cosine, and so on.
For text categorization, use the cosine(cosine)to calculate the similarity is more than the European(Euclidean)The distance is more appropriate.

3, category of decision
Voting decision: The minority obey the majority, the nearest neighbor in which category of points is divided into this category, belong to the frequency of the standard.
Weighted voting method: According to the distance, the nearest neighbor votes weighted, the closer the weight is greater (weight is the inverse of the distance squared), belongs to the quantification of the standard.

Iii. Advantages and disadvantages

1, Advantages
Simple, easy to understand, easy to implement, no need to estimate parameters, no training required
Suitable for classifying rare events (e.g., when the churn rate is low, e.g. below0.5%, structural loss prediction model)
Especially suitable for multi-classification problems(multi-modal,object has more than one category label), for example, to determine their functional classification according to their genetic characteristics,KNNthanSVMperformance is better

2, shortcomings
Lazy algorithm, when the test sample classification of large computational capacity, memory overhead, slow scoring
It is not possible to explain the rules of decision trees.

Four, using KNN for handwriting recognition

If there is training data, are two worth of gray-scale images, from the handwriting panel collection of image data. The following is the number ' 0 ', where the folder contains a file that represents 0~9, the folder name A_b.txt,a represents the real number, B represents the number of B samples (the more general data has the benefit of approaching the predicted value)


In another folder, there are similarly named data files that are used to verify the accuracy of supervised learning, which we call test data.

In the code, we need three of the functions

def classify0 (InX, DataSet, labels, k)--used to classify input single sample InX, dataset as training data, labels as training data category, K as neighbor range

def img2vector (filename)-Converts the data specification in file filename from 32x32 to 1x1024 vector

def handwritingclasstest ()--Test with test data to get error rate

I only use 0~9 20 training data, to improve the speed. Need source code can go to machine learning actual combat in the companion code to take http://vdisk.weibo.com/s/uEZesAafcjQgx?sudaref=www.baidu.com

The NumPy library is used in the code, and the NumPy library is more efficient in calculating large amounts of data.


NumPy Usage Small copy:

>>> tile ([0, 0], (1, 2))

Array ([[[0, 0, 0, 0]])

>>> tile ([0, 0], (2, 1))

Array ([[0, 0],

[0, 0]])

The first one is matrix a

The second argument is when you want to have only one number, which indicates the number of repetitions of the element in a

Two parameters (x, y) y indicates the number of repetitions of the element in a, x Represents x times for the previous operation .

>>> b= Np.arange (reshape) (3,4)

>>> b

Array ([[0, 1, 2, 3],

[4, 5, 6, 7],

[8, 9, 10, 11]])

>>> B.sum (axis=0) # calculate each column's and, note the meaning of the axis, refer to the first article of the array

Array ([12, 15, 18, 21])

>>> B.min (axis=1) # Gets the minimum value for each row

Array ([0, 4, 8])

>>> B.cumsum (Axis=1) # calculates the accumulation of each row and

Array ([[0, 1, 3, 6],

[4, 9, 15, 22],

[8, 17, 27, 38]])


It needs to be supplemented by additional blogging.

knn.py

#! /usr/bin/env python#coding=utf-8from numpy Import *import operatorfrom os import listdirdef classify0 (InX, DataSet, label S, k): #inX------[x,x,x,x] #dataSet------Array ([[[X,x,x,x],[x,x,x,x]]) #labels------[x,x] #k------n DataSetS ize = dataset.shape[0] Diffmat = Tile (InX, (datasetsize,1))-DataSet Sqdiffmat = diffmat**2 sqdistances = Sqdiff              Mat.sum (axis=1) distances = sqdistances**0.5 sorteddistindicies = Distances.argsort () classcount={} For I in range (k): Voteilabel = Labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (v oteilabel,0) + 1 #将字典按value值大小降序排序, result for two-dimensional list sortedclasscount = sorted (Classcount.iteritems (), key=operator.it  Emgetter (1), reverse=true) return sortedclasscount[0][0]def img2vector (filename): Returnvect = Zeros ((1,1024)) FR = open (filename, ' r ') for I in range (+): Linestr = Fr.readline () for J in range: Returnve CT[0,32*I+J] = Int (linESTR[J]) return returnvecttrainfile = ' F:\\python\\pyproject\\ml\\codes\\machinelearninginaction\\ch02\\training20 \ \ ' testfile = ' f:\\python\\pyproject\\ml\\codes\\machinelearninginaction\\ch02\\testdigits\\ ' def Handwritingclasstest (): Hwlabels = [] trainingfilelist = Listdir (trainfile) #load the training set m =        Len (trainingfilelist) Trainingmat = Zeros ((m,1024)) for I in Range (m): Filenamestr = Trainingfilelist[i] Filestr = Filenamestr.split ('. ') [0] #take off. txt classnumstr = int (Filestr.split ('_') [0]) hwlabels.append (classnumstr) path = t Rainfile + '%s ' trainingmat[i,:] = img2vector (path% filenamestr) Testfilelist = Listdir (testfile) #itera Te through the test set errorcount = 0.0 mtest = Len (testfilelist) for I in Range (mtest): Filenamestr = Te Stfilelist[i] Filestr = Filenamestr.split ('. ') [0] #take off. txt classnumstr = int (Filestr.split ('_') [0]) path = tesTfile + '%s ' vectorundertest = img2vector (path% filenamestr) Classifierresult = Classify0 (Vectorundertest, Trainingmat, Hwlabels, 3) print "The classifier came back with:%d, the real answer is:%d"% (Classifierresult, cl  ASSNUMSTR) if (classifierresult! = classnumstr): Errorcount + = 1.0 print "\nthe total number of errors is:%d"%         Errorcount print "\nthe total error rate is:%f"% (Errorcount/float (mtest))

And then call Knn.handwritingclasstest () in the test.py, the program starts running

test.py

#! /usr/bin/env Python#coding=utf-8import knnknn.handwritingclasstest ()


Can see the error rate of 10.68%, very high, increase the amount of training data should be reduced some.




"Machine learning" KNN algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.