Demonstrate the practical use of kNN algorithms using Python code examples

Source: Internet
Author: User
This article describes how to use the kNN algorithm using Python code examples. Here is an example to predict the gender of Douban movie users. If you need a friend, refer to the adjacent algorithm, or K-Nearest Neighbor (kNN, k-NearestNeighbor) classification algorithm is one of the simplest methods in Data Mining classification technology. The so-called K-Nearest Neighbor refers to k nearest neighbors, which means that each sample can be represented by k nearest neighbors.
The core idea of kNN algorithm is that if most of the k adjacent samples of a sample in the feature space belong to a certain category, the sample also belongs to this category, it also has the characteristics of samples in this category. In determining the classification decision, this method only determines the category of the samples to be classified based on the class of the nearest or several samples. The kNN method is only related to a very small number of adjacent samples in class decision-making. Since the kNN method mainly relies on a limited number of adjacent samples, rather than the method used to determine the category of the class, for a class set with many overlapping or overlapping classes, kNN is more suitable than other methods.

In, the Green Circle is determined to be assigned to which class, is it a red triangle or a blue square? If K = 3, because the proportion of the red triangle is 2/3, the green circle will be assigned to the class of the Red Triangle. If K = 5, because the proportion of the blue square is 3/5, therefore, the Green Circle is given a blue square category.
K-Nearest Neighbor (KNN) classification algorithm is a theoretically mature method and one of the simplest machine learning algorithms. The idea of this method is: if most of the k most similar samples in the feature space (that is, the most adjacent samples in the feature space) belong to a certain category, the sample also belongs to this category. In KNN algorithm, the selected neighbors are objects that have been correctly classified. This method only determines the category of the samples to be classified based on the class of one or more adjacent samples. The KNN method also relies on the Limit Theorem in principle, but in classification decision-making, it is only related to a very small number of adjacent samples. Since the KNN method mainly relies on a limited number of adjacent samples, rather than the method used to determine the category of the class, for a class set with many overlapping or overlapping classes, KNN is more suitable than other methods.
KNN can be used for classification and regression. You can obtain the attributes of a sample by finding k nearest neighbors and assigning the average values of these neighbor attributes to the sample. A more useful method is to give different weights to the influence of different distance neighbors on the sample, for example, the weights are inversely proportional to the distance.

Use kNN algorithm to predict the gender of Douban movie users
Summary

This article assumes that the types of movies preferred by people of different gender will be different, so we conducted this experiment. Take advantage of the 274 most active Douban users who recently watched 100 movies and collect statistics on their types to obtain 37 types of movies as attribute features and build sample sets based on user gender as tags. The kNN algorithm is used to construct a gender classifier for Douban movie users. 90% of the samples are used as training samples, and 10% are used as test samples. The accuracy can reach 81.48%.

Lab data

In this experiment, we used the data to mark the movies we watched as Douban users. We selected 274 most recent Douban users. Collects statistics on the movie types of each user. There are a total of 37 movie types in the data used in this experiment. Therefore, the 37 types are used as the user's attribute features. The value of each feature is the number of movies of this type in the user's 100 movies. The user's label is his/her gender. Because Douban does not have the user's gender information, they are all manually labeled.

The data format is as follows:

X1,1,X1,2,X1,3,X1,4……X1,36,X1,37,Y1X2,1,X2,2,X2,3,X2,4……X2,36,X2,37,Y2…………X274,1,X274,2,X274,3,X274,4……X274,36,X274,37,Y274

Example:

0,0,0,3,1,34,5,0,0,0,11,31,0,0,38,40,0,0,15,8,3,9,14,2,3,0,4,1,1,15,0,0,1,13,0,0,1,1 0,1,0,2,2,24,8,0,0,0,10,37,0,0,44,34,0,0,3,0,4,10,15,5,3,0,0,7,2,13,0,0,2,12,0,0,0,0

There are a total of 274 rows of data, which indicates 274 samples. Each of the first 37 data items is the 37 feature values of the sample. The last data is a tag, that is, Gender: 0 indicates male, and 1 indicates female.

In this test, the first 10% of samples were taken as test samples, and the remaining samples were used as training samples.

First, normalize all data. Calculate the maximum value (max_j) and minimum value (min_j) for each column in the matrix, and calculate the data X_j,
X_j = (X_j-min_j)/(max_j-min_j ).

Then, calculate the Euclidean distance between each test sample and all training samples. The distance between test sample I and training sample j is:
Distance_ I _j = sqrt (Xi, 1-Xj, 1) ^ 2 + (Xi, 2-Xj, 2) ^ 2 + ...... + (Xi, 37-Xj, 37) ^ 2 ),
All the distances of sample I are sorted from small to large, and the tag with the most appears in the top k is the predicted value of sample I.

Lab results

Select an appropriate K value. For k =, the same test sample and training sample are used to test the accuracy. The results are shown in the following table.

Accuracy table with different K values

According to the above results, when k = 3, the average test accuracy is the highest, which is 74.07%, and the maximum value is 81.48%.

The preceding Test Sets are from the same sample set and are randomly selected.

Python code

This code is not original. It comes from "machine learning practice" (Peter Harrington, 2013) and has been changed.

#coding:utf-8from numpy import *import operatordef classify0(inX, dataSet, labels, k):  dataSetSize = dataSet.shape[0]  diffMat = tile(inX, (dataSetSize,1)) - dataSet  sqDiffMat = diffMat**2  sqDistances = sqDiffMat.sum(axis=1)  distances = sqDistances**0.5  sortedDistIndicies = distances.argsort()     classCount={}       for i in range(k):    voteIlabel = labels[sortedDistIndicies[i]]    classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1  sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)  return sortedClassCount[0][0]def autoNorm(dataSet):  minVals = dataSet.min(0)  maxVals = dataSet.max(0)  ranges = maxVals - minVals  normDataSet = zeros(shape(dataSet))  m = dataSet.shape[0]  normDataSet = dataSet - tile(minVals, (m,1))  normDataSet = normDataSet/tile(ranges, (m,1))  #element wise pide  return normDataSet, ranges, minValsdef file2matrix(filename):  fr = open(filename)  numberOfLines = len(fr.readlines())     #get the number of lines in the file  returnMat = zeros((numberOfLines,37))    #prepare matrix to return  classLabelVector = []            #prepare labels return    fr = open(filename)  index = 0  for line in fr.readlines():    line = line.strip()    listFromLine = line.split(',')    returnMat[index,:] = listFromLine[0:37]    classLabelVector.append(int(listFromLine[-1]))    index += 1  fr.close()  return returnMat,classLabelVectordef genderClassTest():  hoRatio = 0.10   #hold out 10%  datingDataMat,datingLabels = file2matrix('doubanMovieDataSet.txt')    #load data setfrom file  normMat,ranges,minVals=autoNorm(datingDataMat)  m = normMat.shape[0]  numTestVecs = int(m*hoRatio)  testMat=normMat[0:numTestVecs,:]  trainMat=normMat[numTestVecs:m,:]  trainLabels=datingLabels[numTestVecs:m]  k=3  errorCount = 0.0  for i in range(numTestVecs):    classifierResult = classify0(testMat[i,:],trainMat,trainLabels,k)    print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])    if (classifierResult != datingLabels[i]):      errorCount += 1.0  print "Total errors:%d" %errorCount  print "The total accuracy rate is %f" %(1.0-errorCount/float(numTestVecs))
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.