[Python] uses kNN algorithm to predict the gender of Douban movie users.
Use kNN algorithm to predict gender summarization of Douban movie users
This article assumes that the types of movies preferred by people of different gender will be different, so we conducted this experiment. Take advantage of the 274 most active Douban users who recently watched 100 movies and collect statistics on their types to obtain 37 types of movies as attribute features and build sample sets based on user gender as tags. The kNN algorithm is used to construct a gender classifier for Douban movie users. 90% of the samples are used as training samples, and 10% are used as test samples. The accuracy can reach 81.48%.
Lab data
In this experiment, we used the data to mark the movies we watched as Douban users. We selected 274 most recent Douban users. Collects statistics on the movie types of each user. There are a total of 37 movie types in the data used in this experiment. Therefore, the 37 types are used as the user's attribute features. The value of each feature is the number of movies of this type in the user's 100 movies. The user's label is his/her gender. Because Douban does not have the user's gender information, they are all manually labeled.
The data format is as follows:
X1, 1, X1, 2, X1, 3, X1, 4 ...... X1, 36, X1, 37, Y1
X2, 1, X2, 2, X2, 3, X2, 4 ...... X2, 36, X2, 37, Y2
............
X274, 1, X274, 2, X274, 3, X274, 4 ...... X274, 36, X274, 37, Y274
Example:
,
There are a total of 274 rows of data, which indicates 274 samples. Each of the first 37 data items is the 37 feature values of the sample. The last data is a tag, that is, Gender: 0 indicates male, and 1 indicates female.
KNN algorithm
K-Nearest Neighbor (KNN) is the most basic classification algorithm. Its basic idea is to use the distance measurement method between different feature values for classification.
Algorithm principle: There is a sample data set (training set), and each data in the sample set has tags (that is, the relationship between each data and its category is known ). After entering new data without tags, compare each feature of the new data with the feature corresponding to the data in the sample set (calculate the Euclidean distance), and then extract the most similar data (nearest neighbor) of the features in the sample set). Generally, the first k of the most similar data are obtained, and then the last classification of the new data is obtained from the k tags (categories) with the most frequently seen in the most similar data.
In this test, the first 10% of samples were taken as test samples, and the remaining samples were used as training samples.
First, normalize all data. Calculate the maximum value (max_j) and minimum value (min_j) for each column in the matrix, and calculate the data X_j,
X_j = (X_j-min_j)/(max_j-min_j ).
Then, calculate the Euclidean distance between each test sample and all training samples. The distance between test sample I and training sample j is:
Distance_ I _j = sqrt (Xi, 1-Xj, 1) ^ 2 + (Xi, 2-Xj, 2) ^ 2 + ...... + (Xi, 37-Xj, 37) ^ 2 ),
All the distances of sample I are sorted from small to large, and the tag with the most appears in the top k is the predicted value of sample I.
Lab results
Select an appropriate K value. For k =, the same test sample and training sample are used to test the accuracy. The results are shown in the following table.
Table 1 accuracy table for selecting different K values
| K |
1 |
3 |
5 |
7 |
| Test Set 1 |
62.96% |
81.48% |
70.37% |
77.78% |
| Test Set 2 |
66.67% |
66.67% |
59.26% |
62.96% |
| Test Set 3 |
62.96% |
74.07% |
70.37% |
74.07% |
| Average Value |
64.20% |
74.07% |
66.67% |
71.60% |
According to the above resultsK = 3The average test accuracy is the highest, which is 74.07%, and the maximum is 81.48%.
The preceding Test Sets are from the same sample set and are randomly selected.
Python code
This code is not original. It comes from "machine learning practice" (Peter Harrington, 2013) and has been changed.
#coding:utf-8from numpy import *import operatordef classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] diffMat = tile(inX, (dataSetSize,1)) - dataSet sqDiffMat = diffMat**2 sqDistances = sqDiffMat.sum(axis=1) distances = sqDistances**0.5 sortedDistIndicies = distances.argsort() classCount={} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]def autoNorm(dataSet): minVals = dataSet.min(0) maxVals = dataSet.max(0) ranges = maxVals - minVals normDataSet = zeros(shape(dataSet)) m = dataSet.shape[0] normDataSet = dataSet - tile(minVals, (m,1)) normDataSet = normDataSet/tile(ranges, (m,1)) #element wise divide return normDataSet, ranges, minValsdef file2matrix(filename): fr = open(filename) numberOfLines = len(fr.readlines()) #get the number of lines in the file returnMat = zeros((numberOfLines,37)) #prepare matrix to return classLabelVector = [] #prepare labels return fr = open(filename) index = 0 for line in fr.readlines(): line = line.strip() listFromLine = line.split(',') returnMat[index,:] = listFromLine[0:37] classLabelVector.append(int(listFromLine[-1])) index += 1 fr.close() return returnMat,classLabelVectordef genderClassTest(): hoRatio = 0.10 #hold out 10% datingDataMat,datingLabels = file2matrix('doubanMovieDataSet.txt') #load data setfrom file normMat,ranges,minVals=autoNorm(datingDataMat) m = normMat.shape[0] numTestVecs = int(m*hoRatio) testMat=normMat[0:numTestVecs,:] trainMat=normMat[numTestVecs:m,:] trainLabels=datingLabels[numTestVecs:m] k=3 errorCount = 0.0 for i in range(numTestVecs): classifierResult = classify0(testMat[i,:],trainMat,trainLabels,k) print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]) if (classifierResult != datingLabels[i]): errorCount += 1.0 print "Total errors:%d" %errorCount print "The total accuracy rate is %f" %(1.0-errorCount/float(numTestVecs))
References
(US) Peter Harrington; Li Rui, Li Peng, Qu yadong, Wang Bin translator. machine learning practices. Beijing: People's post and telecommunications press, 2013.06.