Use Python code examples to demonstrate the practical use of kNN algorithm, pythonknn
The proximity algorithm, or K-Nearest Neighbor (kNN, k-NearestNeighbor) classification algorithm, is one of the simplest methods in Data Mining classification technology. The so-called K-Nearest Neighbor refers to k nearest neighbors, which means that each sample can be represented by k nearest neighbors.
The core idea of kNN algorithm is that if most of the k adjacent samples of a sample in the feature space belong to a certain category, the sample also belongs to this category, it also has the characteristics of samples in this category. In determining the classification decision, this method only determines the category of the samples to be classified based on the class of the nearest or several samples. The kNN method is only related to a very small number of adjacent samples in class decision-making. Since the kNN method mainly relies on a limited number of adjacent samples, rather than the method used to determine the category of the class, for a class set with many overlapping or overlapping classes, kNN is more suitable than other methods.
In, the Green Circle is determined to be assigned to which class, is it a red triangle or a blue square? If K = 3, because the proportion of the red triangle is 2/3, the green circle will be assigned to the class of the Red Triangle. If K = 5, because the proportion of the blue square is 3/5, therefore, the Green Circle is given a blue square category.
K-Nearest Neighbor (KNN) classification algorithm is a theoretically mature method and one of the simplest machine learning algorithms. The idea of this method is: if most of the k most similar samples in the feature space (that is, the most adjacent samples in the feature space) belong to a certain category, the sample also belongs to this category. In KNN algorithm, the selected neighbors are objects that have been correctly classified. This method only determines the category of the samples to be classified based on the class of one or more adjacent samples. The KNN method also relies on the Limit Theorem in principle, but in classification decision-making, it is only related to a very small number of adjacent samples. Since the KNN method mainly relies on a limited number of adjacent samples, rather than the method used to determine the category of the class, for a class set with many overlapping or overlapping classes, KNN is more suitable than other methods.
KNN can be used for classification and regression. You can obtain the attributes of a sample by finding k nearest neighbors and assigning the average values of these neighbor attributes to the sample. A more useful method is to give different weights to the influence of different distance neighbors on the sample, for example, the weights are inversely proportional to the distance.
Use kNN algorithm to predict the gender of Douban movie users
Summary
This article assumes that the types of movies preferred by people of different gender will be different, so we conducted this experiment. Take advantage of the 274 most active Douban users who recently watched 100 movies and collect statistics on their types to obtain 37 types of movies as attribute features and build sample sets based on user gender as tags. The kNN algorithm is used to construct a gender classifier for Douban movie users. 90% of the samples are used as training samples, and 10% are used as test samples. The accuracy can reach 81.48%.
Lab data
In this experiment, we used the data to mark the movies we watched as Douban users. We selected 274 most recent Douban users. Collects statistics on the movie types of each user. There are a total of 37 movie types in the data used in this experiment. Therefore, the 37 types are used as the user's attribute features. The value of each feature is the number of movies of this type in the user's 100 movies. The user's label is his/her gender. Because Douban does not have the user's gender information, they are all manually labeled.
The data format is as follows:
X1,1,X1,2,X1,3,X1,4……X1,36,X1,37,Y1X2,1,X2,2,X2,3,X2,4……X2,36,X2,37,Y2…………X274,1,X274,2,X274,3,X274,4……X274,36,X274,37,Y274
Example:
0,0,0,3,1,34,5,0,0,0,11,31,0,0,38,40,0,0,15,8,3,9,14,2,3,0,4,1,1,15,0,0,1,13,0,0,1,1 0,1,0,2,2,24,8,0,0,0,10,37,0,0,44,34,0,0,3,0,4,10,15,5,3,0,0,7,2,13,0,0,2,12,0,0,0,0
There are a total of 274 rows of data, which indicates 274 samples. Each of the first 37 data items is the 37 feature values of the sample. The last data is a tag, that is, Gender: 0 indicates male, and 1 indicates female.
In this test, the first 10% of samples were taken as test samples, and the remaining samples were used as training samples.
First, normalize all data. Calculate the maximum value (max_j) and minimum value (min_j) for each column in the matrix, and calculate the data X_j,
X_j = (X_j-min_j)/(max_j-min_j ).
Then, calculate the Euclidean distance between each test sample and all training samples. The distance between test sample I and training sample j is:
Distance_ I _j = sqrt (Xi, 1-Xj, 1) ^ 2 + (Xi, 2-Xj, 2) ^ 2 + ...... + (Xi, 37-Xj, 37) ^ 2 ),
All the distances of sample I are sorted from small to large, and the tag with the most appears in the top k is the predicted value of sample I.
Lab results
Select an appropriate K value. For k =, the same test sample and training sample are used to test the accuracy. The results are shown in the following table.
Accuracy table with different K values
According to the above results, when k = 3, the average test accuracy is the highest, which is 74.07%, and the maximum value is 81.48%.
The preceding Test Sets are from the same sample set and are randomly selected.
Python code
This code is not original. It comes from "machine learning practice" (Peter Harrington, 2013) and has been changed.
#coding:utf-8from numpy import *import operatordef classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] diffMat = tile(inX, (dataSetSize,1)) - dataSet sqDiffMat = diffMat**2 sqDistances = sqDiffMat.sum(axis=1) distances = sqDistances**0.5 sortedDistIndicies = distances.argsort() classCount={} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]def autoNorm(dataSet): minVals = dataSet.min(0) maxVals = dataSet.max(0) ranges = maxVals - minVals normDataSet = zeros(shape(dataSet)) m = dataSet.shape[0] normDataSet = dataSet - tile(minVals, (m,1)) normDataSet = normDataSet/tile(ranges, (m,1)) #element wise divide return normDataSet, ranges, minValsdef file2matrix(filename): fr = open(filename) numberOfLines = len(fr.readlines()) #get the number of lines in the file returnMat = zeros((numberOfLines,37)) #prepare matrix to return classLabelVector = [] #prepare labels return fr = open(filename) index = 0 for line in fr.readlines(): line = line.strip() listFromLine = line.split(',') returnMat[index,:] = listFromLine[0:37] classLabelVector.append(int(listFromLine[-1])) index += 1 fr.close() return returnMat,classLabelVectordef genderClassTest(): hoRatio = 0.10 #hold out 10% datingDataMat,datingLabels = file2matrix('doubanMovieDataSet.txt') #load data setfrom file normMat,ranges,minVals=autoNorm(datingDataMat) m = normMat.shape[0] numTestVecs = int(m*hoRatio) testMat=normMat[0:numTestVecs,:] trainMat=normMat[numTestVecs:m,:] trainLabels=datingLabels[numTestVecs:m] k=3 errorCount = 0.0 for i in range(numTestVecs): classifierResult = classify0(testMat[i,:],trainMat,trainLabels,k) print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]) if (classifierResult != datingLabels[i]): errorCount += 1.0 print "Total errors:%d" %errorCount print "The total accuracy rate is %f" %(1.0-errorCount/float(numTestVecs))