[Python] uses kNN algorithm to predict the gender of Douban movie users.

Source: Internet
Author: User

[Python] uses kNN algorithm to predict the gender of Douban movie users.
Use kNN algorithm to predict gender summarization of Douban movie users

This article assumes that the types of movies preferred by people of different gender will be different, so we conducted this experiment. Take advantage of the 274 most active Douban users who recently watched 100 movies and collect statistics on their types to obtain 37 types of movies as attribute features and build sample sets based on user gender as tags. The kNN algorithm is used to construct a gender classifier for Douban movie users. 90% of the samples are used as training samples, and 10% are used as test samples. The accuracy can reach 81.48%.

Lab data

In this experiment, we used the data to mark the movies we watched as Douban users. We selected 274 most recent Douban users. Collects statistics on the movie types of each user. There are a total of 37 movie types in the data used in this experiment. Therefore, the 37 types are used as the user's attribute features. The value of each feature is the number of movies of this type in the user's 100 movies. The user's label is his/her gender. Because Douban does not have the user's gender information, they are all manually labeled.

The data format is as follows:
X1, 1, X1, 2, X1, 3, X1, 4 ...... X1, 36, X1, 37, Y1
X2, 1, X2, 2, X2, 3, X2, 4 ...... X2, 36, X2, 37, Y2
............
X274, 1, X274, 2, X274, 3, X274, 4 ...... X274, 36, X274, 37, Y274

Example:
,

There are a total of 274 rows of data, which indicates 274 samples. Each of the first 37 data items is the 37 feature values of the sample. The last data is a tag, that is, Gender: 0 indicates male, and 1 indicates female.

KNN algorithm

K-Nearest Neighbor (KNN) is the most basic classification algorithm. Its basic idea is to use the distance measurement method between different feature values for classification.

Algorithm principle: There is a sample data set (training set), and each data in the sample set has tags (that is, the relationship between each data and its category is known ). After entering new data without tags, compare each feature of the new data with the feature corresponding to the data in the sample set (calculate the Euclidean distance), and then extract the most similar data (nearest neighbor) of the features in the sample set). Generally, the first k of the most similar data are obtained, and then the last classification of the new data is obtained from the k tags (categories) with the most frequently seen in the most similar data.

In this test, the first 10% of samples were taken as test samples, and the remaining samples were used as training samples.

First, normalize all data. Calculate the maximum value (max_j) and minimum value (min_j) for each column in the matrix, and calculate the data X_j,
X_j = (X_j-min_j)/(max_j-min_j ).

Then, calculate the Euclidean distance between each test sample and all training samples. The distance between test sample I and training sample j is:
Distance_ I _j = sqrt (Xi, 1-Xj, 1) ^ 2 + (Xi, 2-Xj, 2) ^ 2 + ...... + (Xi, 37-Xj, 37) ^ 2 ),
All the distances of sample I are sorted from small to large, and the tag with the most appears in the top k is the predicted value of sample I.

Lab results

Select an appropriate K value. For k =, the same test sample and training sample are used to test the accuracy. The results are shown in the following table.

Table 1 accuracy table for selecting different K values

K 1 3 5 7
Test Set 1 62.96% 81.48% 70.37% 77.78%
Test Set 2 66.67% 66.67% 59.26% 62.96%
Test Set 3 62.96% 74.07% 70.37% 74.07%
Average Value 64.20% 74.07% 66.67% 71.60%

According to the above resultsK = 3The average test accuracy is the highest, which is 74.07%, and the maximum is 81.48%.

The preceding Test Sets are from the same sample set and are randomly selected.

Python code

This code is not original. It comes from "machine learning practice" (Peter Harrington, 2013) and has been changed.

#coding:utf-8from numpy import *import operatordef classify0(inX, dataSet, labels, k):    dataSetSize = dataSet.shape[0]    diffMat = tile(inX, (dataSetSize,1)) - dataSet    sqDiffMat = diffMat**2    sqDistances = sqDiffMat.sum(axis=1)    distances = sqDistances**0.5    sortedDistIndicies = distances.argsort()         classCount={}              for i in range(k):        voteIlabel = labels[sortedDistIndicies[i]]        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)    return sortedClassCount[0][0]def autoNorm(dataSet):    minVals = dataSet.min(0)    maxVals = dataSet.max(0)    ranges = maxVals - minVals    normDataSet = zeros(shape(dataSet))    m = dataSet.shape[0]    normDataSet = dataSet - tile(minVals, (m,1))    normDataSet = normDataSet/tile(ranges, (m,1))   #element wise divide    return normDataSet, ranges, minValsdef file2matrix(filename):    fr = open(filename)    numberOfLines = len(fr.readlines())         #get the number of lines in the file    returnMat = zeros((numberOfLines,37))       #prepare matrix to return    classLabelVector = []                       #prepare labels return       fr = open(filename)    index = 0    for line in fr.readlines():        line = line.strip()        listFromLine = line.split(',')        returnMat[index,:] = listFromLine[0:37]        classLabelVector.append(int(listFromLine[-1]))        index += 1    fr.close()    return returnMat,classLabelVectordef genderClassTest():    hoRatio = 0.10      #hold out 10%    datingDataMat,datingLabels = file2matrix('doubanMovieDataSet.txt')       #load data setfrom file    normMat,ranges,minVals=autoNorm(datingDataMat)    m = normMat.shape[0]    numTestVecs = int(m*hoRatio)    testMat=normMat[0:numTestVecs,:]    trainMat=normMat[numTestVecs:m,:]    trainLabels=datingLabels[numTestVecs:m]    k=3    errorCount = 0.0    for i in range(numTestVecs):        classifierResult = classify0(testMat[i,:],trainMat,trainLabels,k)        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])        if (classifierResult != datingLabels[i]):            errorCount += 1.0    print "Total errors:%d" %errorCount    print "The total accuracy rate is %f" %(1.0-errorCount/float(numTestVecs))

 

References

(US) Peter Harrington; Li Rui, Li Peng, Qu yadong, Wang Bin translator. machine learning practices. Beijing: People's post and telecommunications press, 2013.06.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.