First, case backgroundMy friend Helen has been using the online dating site to find the right date for her. Although dating sites recommend different candidates, she doesn't like everyone. After a summary, she found that there were three types of people who had intercourse:
(1) A person who does not like;(2) A person of general charm;(3) A person of great charm;Despite discovering the rules, Helen is still unable to classify the matching objects recommended by the dating site into the right category, and she feels that it is possible to date the most glamorous people from Monday to Friday, while the weekend prefers to be accompanied by those who are very attractive. Helen wants our classification software to be better able to help her classify the matching objects into the exact categories. In addition, Helen collects data that has not been recorded on dating sites, and she believes that the data is more useful for matching object collations. second, case analysis(1) Collect data: Provide text documents;(2) Preparing data: Parsing text files using Python;(3) analysis data: Using Matplotlib to draw two-dimensional diffusion map;(4) Training algorithm: This step does not apply to K-nearest neighbor algorithm;(5) test algorithm: Using some of the data provided by Helen as a test sample,the difference between a test sample and a non-test sample is that the test sample is the data that has been sorted, and if the forecast classification differs from the actual category, it is marked as an error. (6) Use algorithm: Generate a simple command-line program, then Helen can enter some characteristic data to determine whether the other person is the type of their liking. Iii. Preparing data: Parsing data from a text fileHelen had been collecting dating data for some time, and she kept the data in the text file DatingTestSet.txt, with each sample data occupying a row, with a total of 1000 rows. Helen's sample consists of the following 3 characteristics: 1. Number of frequent flyer miles per year, 2. Percentage of time spent playing video games, 3. Number of litres of ice cream consumed per week; Before you can enter the above feature data into a classifier, you must change the format of the pending data to a format acceptable to the classifier. Create a function named File2matrix in knn.py to handle the input format problem. The input to the function is a text file name string, and the output is a training sample matrix and a class label vector. Add the following code to knn.py:
Iv. Analyzing data: Creating a scatter plot using matplotlib First we use matplotlib to make a scatter plot of the raw data, and in the Python command-line environment, enter the following command:
#!/usr/bin/python278# _*_ coding:utf-8 _*_import knnreload (KNN) Datingdatamat,datinglabels=knn.file2matrix (' DatingTestSet2.txt ') import matplotlibimport matplotlib.pyplot as Pltzhfont = Matplotlib.font_manager. Fontproperties (fname= ' C:\WINDOWS\FONTS\UKAI.TTC ') fig=plt.figure () Ax=fig.add_subplot (111) from NumPy Import * Ax.scatter (datingdatamat[:,1],datingdatamat[:,2]) Plt.xlabel (percentage of time spent playing games ', Fontproperties=zhfont) Plt.ylabel ( U ' weekly consumption of ice cream litres per week ', Fontproperties=zhfont) plt.show ()
is an appointment data scatter plot without sample labels, it is difficult to identify which sample classification the points in the graph belong to, and we can use the scatter function provided by the Matplotlib library to mark the dots on the scatter plot with color. Re-enter the above code and call the Scatter function:
#!/usr/bin/python278# _*_ coding:utf-8 _*_import knnreload (KNN) Datingdatamat,datinglabels=knn.file2matrix (' DatingTestSet2.txt ') import matplotlibimport matplotlib.pyplot as Pltzhfont = Matplotlib.font_manager. Fontproperties (fname= ' C:\WINDOWS\FONTS\UKAI.TTC ') fig=plt.figure () Ax=fig.add_subplot (111) from NumPy Import * Ax.scatter (Datingdatamat[:,1],datingdatamat[:,2],15.0*array (datinglabels), 15.0*array (DatingLabels)) Plt.xlabel ( U ' play game time% ', Fontproperties=zhfont) Plt.ylabel (U ' consumption of ice cream litres per week ', Fontproperties=zhfont) plt.show ()
is an appointment data scatter chart with a sample classification label, although it is easier to compare the regional score point dependency categories, it is still difficult to draw conclusions from this graph. Using the Datingdatamat Matrix attribute column 2 and column 3 to show the data, although it can be distinguished, but with the property values of column 1 and column 2 can be better results:
#!/usr/bin/env python# _*_ coding:utf-8 _*_import knnreload (kNN) import Matplotlibimport Matplotlib.pyplot as Pltmatrix, Labels = Knn.file2matrix (' datingTestSet2.txt ') print matrixprint Labelszhfont = Matplotlib.font_manager. Fontproperties (fname= ' C:\WINDOWS\FONTS\UKAI.TTC ') plt.figure (figsize= (8, 5), dpi=80) axes = Plt.subplot (111) # Take three classes of data out separately # X axis represents the number of miles flown # Y axis represents the percentage of playing video games type1_x = type1_y = type2_x = type2_y = type3_x = type3_y = print ' Range ( Len (labels)): ' Print Range ' (len (labels)) for I in range (len (labels)): if labels[i] = = 1: # Don't like Type1_x.append (Matr Ix[i]) type1_y.append (matrix[i]) if labels[i] = = 2: # Charm General Type2_x.append (Matrix[i]) typ E2_y.append (matrix[i]) if labels[i] = = 3: # Very attractive print I, ': ', labels[i], ': ', type (labels[i]) type3_ X.append (Matrix[i]) type3_y.append (matrix[i]) type1 = Axes.scatter (type1_x, type1_y, s=20, c= ' red ') type2 = Axe S.scatter (type2_x, type2_y, s=40, c= ' green ') Type3 = AXES.SCatter (type3_x, type3_y, s=50, c= ' Blue ') # Plt.scatter (matrix[:, 0], matrix[:, 1], s=20 * Numpy.array (labels), # C=50 * Numpy.array (labels), marker= ' O ', # label= ' Test ') Plt.xlabel (U ' miles earned per year ', Fontproperties=zhfont) Plt.yla Bel (U ' play video game consumes percentage of events ', Fontproperties=zhfont) axes.legend ((Type1, type2, Type3), (U ' Don't like ', U ' charm General ', U ' Charismatic '), loc=2, Prop=zhfont) Plt.show ()
The figure clearly identifies three different sample categories, and the categories of people with different interests are different, and you can see that the "earn frequent flyer miles per year" and the "percentage of time spent playing video games" shown in the figure two features make it easier to differentiate data point dependent categories.v. Preparation data: Normalized valuesin order to prevent the difference in the number of eigenvalues on the effect of the predicted results, such as the calculation of distance, the value of the larger values of the result has a large impact on the results, so we have all the eigenvalues of the data will be normalized to [0,1] preprocessing.
def autonorm (dataSet): minvals = dataset.min (0) maxvals = Dataset.max (0) ranges = Maxvals-minvals Normdataset = Zeros (Shape (dataSet)) m = dataset.shape normdataset = Dataset-tile (Minvals, (m,1) ) Normdataset = Normdataset/tile (ranges, (m,1)) #element wise divide return normdataset, ranges, minvals
Code explanation: in function autonorm (), the minimum value of each column is placed in the variable minvals and the maximum value is placed in Maxvals, where parameter 0 in dataset.min (0) allows the function to pick the minimum value from the column instead of the minimum value of the current row. Because the eigenvalues matrix DataSet is 1000x3, and minvals and range are 1x3, you need to use the tile () function to copy the contents of the Minvals and range into a matrix of the same size as the input matrix.
>>> Import knn>>> Reload (KNN) <module ' KNN ' from ' Knn.pyc ' >>>> Datingdatamat, Datinglabels=knn.file2matrix (' datingTestSet2.txt ') >>> normmat,ranges,minvals=knn.autonorm ( Datingdatamat) >>> Normmatarray ([[0.44832535, 0.39805139, 0.56233353], [0.15873259, 0.34195467, 0.98724416], [0.28542943, 0.06892523, 0.47449629], ..., [0.29115949, 0.50910294, 0.51079493], [0.52711097, 0.43665451, 0.4290048], [0.47940793, 0.3768091, 0.78571804]]) >>> rangesarray ([ 9.12730000e+04, 2.09193490e+01, 1.69436100E+00]) >>> Minvalsarray ([0. , 0. , 0.001156])
Six, the test algorithmone of the most important tasks in machine learning algorithms is to evaluate the correctness of the algorithm, usually we train the classifier with 90% of the existing data, and use the remaining 10% data to test the classifier to detect the correct rate of the classifier.
1. Classifier test code for the dating site:
Def datingclasstest (): hoRatio = 0.50 #hold out 10% datingdatamat,datinglabels = File2matrix (' DatingTestSet2.txt ') #load data setfrom file normmat, ranges, minvals = Autonorm (datingdatamat) m = Normmat.shape numtestvecs = Int (m*horatio) errorcount = 0.0 for i in range (numtestvecs): Classifierresult = Classify0 (normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3) print "the Classifier came back with:%d, the real answer is:%d "% (Classifierresult, datinglabels[i]) if (classifierresult! = Datinglabels[i]): Errorcount + = 1.0 Print "The total error rate is:%f"% (Errorcount/float (numtestvecs)) Print Errorcount
>>> knn.datingclasstest () The classifier came back with:3, the real answer is:3the classifier came back with:2, The real answer is:2the classifier came back with:1, the real answer is:1the classifier came back with:1, the real an Swer is:1the classifier came back with:1, the real answer is:1the classifier came back with:1, the real answer is:1th E classifier came back with:3, the real answer is:3the classifier came back with:3, the real answer is:3the classifier Came back with:1, the real answer is:1the classifier came back with:3, the real answer is:3the classifier came back W Ith:1, the real answer is:1the classifier came back with:1, the real answer is:1
The classifier came back with:1, the real answer is:1the classifier came back with:1, the real answer is:1the Classifi Er came back with:1, the real answer is:1the classifier came back with:3, the real answer is:3the classifier came back With:1, the real answer is:1the classifier came back with:2, the real answer is:1the classifier came back with:2, th E Real answer is:2the classifier came back with:1, the real answer is:1the classifier came back with:1, the real answe R is:1the Classifier came back with:2, the real answer is:2the total error rate is:0.064000
vii. use of algorithmsEnter a person's information to predict how much Helen likes each other:
Def classifyperson (): resultlist=[' not @ all ', ' in small doses ', ' in large doses '] percenttats=float (raw_input ("Percentage of time spent playing video games?")) Ffmiles=float (Raw_input ("Frequent flier miles earned per year?")) Icecream=float (Raw_input ("liters of ice cream consumed per year?" ) Datingdatamat,datinglabels=file2matrix (' datingTestSet2.txt ') normmat,ranges,minvals=autonorm (Datingdatamat ) Inarr=array ([Ffmiles,percenttats,icecream]) classifierresult=classify0 ((inarr-minvals)/ranges, normmat,datinglabels,3) print "You'll probably like this person :", Resultlist[classifierresult-1]
Code Explanation: Raw_input () in Python allows the user to enter a text line command and return the command entered by the user
>>> Import knn>>> Reload (KNN) <module ' KNN ' from ' knn.py ' >>>> Knn.classifyperson () Percentage of time spent playing video games?10frequent flier miles earned per year?10000liters of ice cream consumed per Year?0.5you would probably like this person:in small doses
Machine learning Practical notes--using KNN algorithm to improve the pairing effect of dating sites