The simple introduction of KNN (k nearest neighbor) algorithm

Source: Internet
Author: User
machine Learning Combat (Chapter II: K-Nearest neighbor algorithm)

Today I studied the second chapter, here I understand to do a simple summary, is to deepen my understanding and in my own language to describe the algorithm.


Distance Calculation

Computation of Euclidean distance based on vector space. (L2 distance)

In particular, LP distance (L1 distance) can be used for distance.


The simple point is that in a large sample concentration, each instance has 3 or more feature attributes, one of which is necessarily a categorical attribute, and the remainder is a numeric attribute (even a nominal property can be transformed by some means), each of which is a vector of attribute eigenvalues. A sample set is composed of multiple vectors.

For example, the following example

Height Weight Age Gender
170 140 22 Man
160 100 21st Woman

"Gender" can be viewed as a categorical attribute, and then others look at feature attributes, forming an instance vector for [170,140,22] and [160,100,21]


first, the algorithm steps:
1, calculate the distance between the point in the dataset of the known category and the current point;
2, according to the order of increasing the distance;
3. Select the k point with the minimum distance from the current point;
4, to determine the frequency of the category K points
, K is used to select the number of nearest neighbor, K choice is very sensitive. The smaller the K value means the higher the complexity of the model, thus, it is easy to produce the fitting, the larger K value means the whole model becomes simple, the approximate error of learning increases, in the practical application, a relatively small k value is generally used, and a cross validation method is used to select an optimal K value. )
5, return the first K points appear the highest frequency category as the current point of the forecast classification

Second, example description (Python) First, according to the algorithm to write a KNN algorithm class

'''
InX: Input vectors for classification
DataSet: Training Sample Set
Labels: label vector
K: Used to select the number of nearest neighbors
'''
def classify0 (inx,dataset,labels,k):
#距离计算
Datasetsize = dataset.shape[0]
Diffmat = Tile (InX, (datasetsize,1))-dataset
Sqdiffmat = diffmat**2
Sqdistances = Sqdiffmat.sum (Axis=1)
distances = sqdistances**0.5
Sorteddistindicies = Distances.argsort () #按从小到大的排序后的索引值
#选择距离最小的k个点
ClassCount = {}
For I in range (k):
Voteilabel = Labels[sorteddistindicies[i]]
Classcount[voteilabel] = Classcount.get (voteilabel,0) +1
#排序
Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true)
return sortedclasscount[0][0]

Give an example description (handwriting recognition data)

Data Description:



We need to transform each sample data into vector format, because the picture is 32x32 txt text format exists, we turn into 1x1024 vector, and all the training samples are saved as a matrix, each corresponding row is each instance.

#把一个32X32的变成1X1024的向量
def img2vector (filename):
    Returnvec = Zeros ((1,1024))
    fr = open (filename)
    for I in range (k):
        linestr = Fr.readline () for
        J in range (k):
            returnvec[0,32*i+j] = Int (linestr[j))
    Return Returnvec


Using the KNN algorithm class that we have written in advance, we train in the format and calculate the error rate.

Def handwritingclasstest (): Hwlabels = [] #获取训练集 trainingfilelist = Os.listdir (' digits//trainingdigits ') #获取该目  Record all file name types: List m = Len (trainingfilelist) Trainingmat = Zeros ((m,1024)) for I in Range (m): Filenamestr = Trainingfilelist[i] #7_173. txt filestr = filenamestr.split ('. ') [0] # ' 7_173 ' txt ' classnumstr = int (Filestr.split ('_') [0]) #7 173 hwlabels.append (CLA SSNUMSTR) trainingmat[i,:] = Img2vector ("digits//trainingdigits//" +filenamestr) #获取测试集 testfilelist = OS.L Istdir ("digits//testdigits") errorcount = 0 mtest = Len (testfilelist) for I in Range (mtest): FileNameS TR = testfilelist[i] filestr = Filenamestr.split ('. ') [0] # ' 7_173 ' txt ' classnumstr = int (Filestr.split ('_') [0]) # 7 173 vectorundertest = Img2vector ("Digi ts//testdigits//"+filenamestr" Classifierresult = classify0 (vectorundertest,trainingmat,hwlabels,3) PRINT ("The classifier came back with:%d,the real answer is:%d"% (CLASSIFIERRESULT,CLASSNUMSTR)) if Classifierresult != Classnumstr:errorcount + + 1 print "The total number of errors is:%d"%errorcount print " Error rate is:%f "% (Errorcount/float (mtest))

Summary:

Advantages
1. High precision
2. Not sensitive to abnormal values
3. No distribution assumptions about the data
Disadvantages
1, the KNN algorithm does not have a training process like other algorithms
2, the KNN algorithm for those classification of uneven training samples may be large error
3, the amount of calculation is too large, each test sample to be tested to traverse the training samples to calculate the distance
4, we do not know what the example samples and typical sample samples have what characteristics, can not give any data infrastructure information















Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.