The simple introduction of KNN (k nearest neighbor) algorithm

Last Update:2018-07-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

machine Learning Combat (Chapter II: K-Nearest neighbor algorithm)

Today I studied the second chapter, here I understand to do a simple summary, is to deepen my understanding and in my own language to describe the algorithm.

Distance Calculation

Computation of Euclidean distance based on vector space. (L2 distance)

In particular, LP distance (L1 distance) can be used for distance.

The simple point is that in a large sample concentration, each instance has 3 or more feature attributes, one of which is necessarily a categorical attribute, and the remainder is a numeric attribute (even a nominal property can be transformed by some means), each of which is a vector of attribute eigenvalues. A sample set is composed of multiple vectors.

For example, the following example

Height	Weight	Age	Gender
170	140	22	Man
160	100	21st	Woman

"Gender" can be viewed as a categorical attribute, and then others look at feature attributes, forming an instance vector for [170,140,22] and [160,100,21]

first, the algorithm steps:
1, calculate the distance between the point in the dataset of the known category and the current point;
2, according to the order of increasing the distance;
3. Select the k point with the minimum distance from the current point;
4, to determine the frequency of the category K points , K is used to select the number of nearest neighbor, K choice is very sensitive. The smaller the K value means the higher the complexity of the model, thus, it is easy to produce the fitting, the larger K value means the whole model becomes simple, the approximate error of learning increases, in the practical application, a relatively small k value is generally used, and a cross validation method is used to select an optimal K value. ）
5, return the first K points appear the highest frequency category as the current point of the forecast classification

Second, example description (Python) First, according to the algorithm to write a KNN algorithm class

'''
InX: Input vectors for classification
DataSet: Training Sample Set
Labels: label vector
K: Used to select the number of nearest neighbors
'''
def classify0 (inx,dataset,labels,k):
#距离计算
Datasetsize = dataset.shape[0]
Diffmat = Tile (InX, (datasetsize,1))-dataset
Sqdiffmat = diffmat**2
Sqdistances = Sqdiffmat.sum (Axis=1)
distances = sqdistances**0.5
Sorteddistindicies = Distances.argsort () #按从小到大的排序后的索引值
#选择距离最小的k个点
ClassCount = {}
For I in range (k):
Voteilabel = Labels[sorteddistindicies[i]]
Classcount[voteilabel] = Classcount.get (voteilabel,0) +1
#排序
Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true)
return sortedclasscount[0][0]

Give an example description (handwriting recognition data)

Data Description:

We need to transform each sample data into vector format, because the picture is 32x32 txt text format exists, we turn into 1x1024 vector, and all the training samples are saved as a matrix, each corresponding row is each instance.

#把一个32X32的变成1X1024的向量
def img2vector (filename):
    Returnvec = Zeros ((1,1024))
    fr = open (filename)
    for I in range (k):
        linestr = Fr.readline () for
        J in range (k):
            returnvec[0,32*i+j] = Int (linestr[j))
    Return Returnvec

Using the KNN algorithm class that we have written in advance, we train in the format and calculate the error rate.

Def handwritingclasstest (): Hwlabels = [] #获取训练集 trainingfilelist = Os.listdir (' digits//trainingdigits ') #获取该目  Record all file name types: List m = Len (trainingfilelist) Trainingmat = Zeros ((m,1024)) for I in Range (m): Filenamestr = Trainingfilelist[i] #7_173. txt filestr = filenamestr.split ('. ') [0] # ' 7_173 ' txt ' classnumstr = int (Filestr.split ('_') [0]) #7 173 hwlabels.append (CLA SSNUMSTR) trainingmat[i,:] = Img2vector ("digits//trainingdigits//" +filenamestr) #获取测试集 testfilelist = OS.L Istdir ("digits//testdigits") errorcount = 0 mtest = Len (testfilelist) for I in Range (mtest): FileNameS TR = testfilelist[i] filestr = Filenamestr.split ('. ') [0] # ' 7_173 ' txt ' classnumstr = int (Filestr.split ('_') [0]) # 7 173 vectorundertest = Img2vector ("Digi ts//testdigits//"+filenamestr" Classifierresult = classify0 (vectorundertest,trainingmat,hwlabels,3) PRINT ("The classifier came back with:%d,the real answer is:%d"% (CLASSIFIERRESULT,CLASSNUMSTR)) if Classifierresult != Classnumstr:errorcount + + 1 print "The total number of errors is:%d"%errorcount print " Error rate is:%f "% (Errorcount/float (mtest))

Summary:

Advantages
1. High precision
2. Not sensitive to abnormal values
3. No distribution assumptions about the data
Disadvantages
1, the KNN algorithm does not have a training process like other algorithms
2, the KNN algorithm for those classification of uneven training samples may be large error
3, the amount of calculation is too large, each test sample to be tested to traverse the training samples to calculate the distance
4, we do not know what the example samples and typical sample samples have what characteristics, can not give any data infrastructure information

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The simple introduction of KNN (k nearest neighbor) algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The simple introduction of KNN (k nearest neighbor) algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support