Using Python code examples to show the practical application of KNN algorithm _ basic knowledge

Source: Internet
Author: User
Tags prepare ranges

Proximity algorithm, or K nearest neighbor (Knn,k-nearestneighbor) classification algorithm, is one of the simplest methods in data mining classification technology. The so-called k nearest neighbor, is k a nearest neighbour meaning, said that each sample can use its nearest K neighbor to represent.
The core idea of KNN algorithm is that if the majority of the K-nearest samples in the feature space belong to a certain category, the sample belongs to the class and has the characteristics of the sample on this category. The method determines the category of the sample to be classified according to the category of the nearest and only one or several samples in determining the classification decision. The KNN method is concerned with only a very small number of adjacent samples in the class decision. The KNN method is more suitable than other methods because the KNN method mainly depends on the neighboring samples, rather than on the method of discriminating the class domain to determine the category.

In the figure above, which class is the green circle to be assigned to, is it a red triangle or a blue quad? If k=3, because the red triangle accounts for 2/3, the green circle will be given the red triangle that class, if the k=5, because the blue four square proportions of 3/5, so green circle is given blue four square class.
K Nearest neighbor (k-nearest NEIGHBOR,KNN) classification algorithm, is a theoretically more mature method, is also one of the simplest machine learning algorithms. The idea of this method is that if a sample is most similar in the K of the feature space (that is, the most neighboring part of the feature space), the sample belongs to the category. In the KNN algorithm, the selected neighbors are the objects that have been correctly categorized. The method determines the class of the sample to be classified according to the category of the nearest or several samples. Although the KNN method relies on the limit theorem in principle, it is only related to a very small number of adjacent samples in the class decision. The KNN method is more suitable than other methods because the KNN method mainly depends on the neighboring samples, rather than on the method of discriminating the class domain to determine the category.
KNN algorithm can be used not only for classification, but also for regression. By finding the K nearest neighbor of a sample, assigning the average value of these neighbors ' attributes to the sample, you can get the properties of the sample. A more useful approach is to give different weights (weight) to the effects of different distances on the sample, such as inversely proportional to the distance.

Predicting the sex of watercress film users by using KNN algorithm
Summary

This paper argues that the type of film that people of different sexes prefer will vary, so this experiment is done. Using more active 274 watercress users recently watched 100 movies, the type of their statistics, to get the 37 types of movies as attribute characteristics, the user sex as a label to build the sample set. Using KNN algorithm to construct the sex classifier of watercress film user, using 90% of sample as training sample, 10% as test sample, the accuracy rate can reach 81.48%.

Experimental data

This experiment uses the data for the watercress user to watch the movie, selected 274 watercress user recently saw 100 movies. The movie type for each user is counted. There are 37 movie types in the data used in this experiment, so the 37 types are used as attributes of the user, and the value of each feature is the number of movies of that type in the user 100 movies. The user's label for its sex, because the watercress has no user sex information, therefore are manual annotation.

The data format looks like this:

x1,1,x1,2,x1,3,x1,4 ... X1,36,x1,37,y1
x2,1,x2,2,x2,3,x2,4 ... X2,36,x2,37,y2 .......
x274,1,x274,2,x274,3,x274,4 ... x274,36,x274,37,y274

Example:

0,0,0,3,1,34,5,0,0,0,11,31,0,0,38,40,0,0,15,8,3,9,14,2,3,0,4,1,1,15,0,0,1,13,0,0,1,1 0,1,0,2,2,24,8,0,0,0,10,37,0,0,44,34,0,0,3,0,4,10,15,5,3,0,0,7,2,13,0,0,2,12,0,0,0,0

There are 274 lines of data like this, representing 274 samples. Each of the first 37 data is the 37 eigenvalues of the sample, and the last data is labeled, that is, sex: 0 for males and 1 for females.

In this test, the first 10% samples were taken as test specimens and the remainder as training samples.

First, all data is normalized. The maximum value (max_j), the minimum value (Min_j) is obtained for each column in the matrix, and the data in the matrix is x_j.
x_j= (X_j-min_j)/(Max_j-min_j).

Then for each test sample, calculate its Euclidean distance from all the training samples. The distance between the test sample I and the Training sample J is:
Distance_i_j=sqrt ((xi,1-xj,1) ^2+ (xi,2-xj,2) ^2+......+ (xi,37-xj,37) ^2),
All the distances from the sample I are sorted from small to large, and the most frequent labels in the first k are the predicted values of sample I.

Experimental results

First, select a suitable k value. For k=1,3,5,7, the same test sample and training sample are used to test the correct rate, and the results are shown in the following table.

Selecting the correct rate table with different k values

From the above results, in k=3, the average accuracy of the test is the highest, 74.07%, the maximum can reach 81.48%.

The different test sets are derived from the same sample set and are randomly selected.

Python code

This code is not original, from the "Machine Learning Combat" (Peter harrington,2013), and changed.

#coding: Utf-8 from numpy import * import operator Def classify0 (InX, DataSet, labels, k): Datasetsize = Dataset.shape [0] Diffmat = Tile (InX, (datasetsize,1))-DataSet Sqdiffmat = diffmat**2 sqdistances = sqdiffmat.sum (Axis=1) Dist ances = sqdistances**0.5 sorteddistindicies = Distances.argsort () classcount={} for I in range (k): Votei label = Labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1 Sortedclasscount = s Orted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true) return sortedclasscount[0][0] def autoNorm ( DataSet): minvals = dataset.min (0) maxvals = Dataset.max (0) ranges = maxvals-minvals Normdataset = zeros (Shape (d Ataset)) m = dataset.shape[0] Normdataset = Dataset-tile (Minvals, (m,1)) Normdataset = Normdataset/tile (ranges, (m , 1) #element wise divide return normdataset, ranges, minvals def file2matrix (filename): FR = open (filename) numbe Roflines = Len (Fr.readlines () #get the number of lines in the ' File Returnmat = zeros ((numberoflines,37)) #prepare matrix to return Classla Belvector = [] #prepare labels return fr = open (filename) index = 0 for line in Fr.readlines (): Lin E = Line.strip () Listfromline = Line.split (', ') returnmat[index,:] = listfromline[0:37] Classlabelvector.appen  d (int (listfromline[-1])) index = 1 fr.close () return returnmat,classlabelvector def genderclasstest (): HoRatio
  = 0.10 #hold out 10% datingdatamat,datinglabels = File2matrix (' doubanMovieDataSet.txt ') #load data setfrom file Normmat,ranges,minvals=autonorm (datingdatamat) m = normmat.shape[0] numtestvecs = Int (m*horatio) testMat=normMat[0:n  Umtestvecs,:] trainmat=normmat[numtestvecs:m,:] trainlabels=datinglabels[numtestvecs:m] k=3 errorCount = 0.0 for I in range (numtestvecs): Classifierresult = classify0 (testmat[i,:],trainmat,trainlabels,k) print "The classifier Came back with:%d, theReal answer is:%d "% (Classifierresult, datinglabels[i]) if (Classifierresult!= datinglabels[i]): Errorcount +

 = 1.0 print "Total errors:%d"%errorcount print "The total accuracy rate is%f" (1.0-errorcount/float (Numtestvecs))

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.