Objective
If you think of an online dating site looking for a date, you're likely to classify all users of the dating site as three categories:
1. Do not like the
2. A bit of glamour
3. Very attractive
How do you decide which category A user belongs to? Presumably you will analyze the user's information to get a conclusion, such as the user "frequent flyer miles per year", "playing online games consumes more time than", "consumption of ice cream litres per week."
The K-Nearest neighbor algorithm using machine learning can help you to obtain the user's three information, automatically help you to classify the user, how convenient!
This article will show you how to implement such an automatic classification program specifically.
First step: Collect and prepare data
First, gather some dating data-as much as you can.
It then stores the data collected in a TXT file, for example, each sample data can be a row,
The three analysis data (features) and analysis Results (integer representations) mentioned in the preface are each a column, as follows:
The function then writes the data out and into the in-memory data structure:
1 #Import numpy Math Operations Library2 ImportNumPy3 4 # ==============================================5 #Input:6 #Training Set file name (with path)7 #Output:8 #feature matrix and label vectors9 # ==============================================Ten defFile2matrix (filename): One 'Get training Set data' A - #Open the training set file -FR =open (filename) the #gets the number of file rows -NumberOfLines =Len (Fr.readlines ()) - #file pointer 0 - fr.seek (0) + #initializing a feature matrix -Returnmat = Numpy.zeros ((numberoflines,3)) + #Initialize Tag vectors AClasslabelvector = [] at #the line number of the characteristic matrix is also the sample number -index =0 - - forLineinchFr#traverse all rows in the training set file - #Remove line breaks and tabs at the end of a line of clothes. -line =Line.strip () in #to split a row with a tab -Listfromline = Line.split ('\ t') to #deposit The characteristic part of the row data into the feature matrix +Returnmat[index,:] = Listfromline[0:3] - #put the row label part of the data into the label matrix theClasslabelvector.append (int (listfromline[-1])) * #Sample number +1 $Index + = 1Panax Notoginseng - returnReturnmat,classlabelvector
Step two: Analyze the data
Once you get the data, you can print to see what you got, as follows:
Obviously, such a display is very unfriendly, and you should use the Python matplotlib library to visualize the captured data.
If you are using the Eclipse plugin to compile Pydev under Ubuntu, installing Matplotlib is a pit.
After getting to the installation package, you also have to add a new library path to the plug-in settings, because Matplotlib is not automatically installed into the Python2.7 Library directory, which differs from NumPy.
The following is the correct library path:
You can then write the following code to analyze the data:
1 # 2 fig = Plt.figure () 3 set 1 row 1 to list the area of the plot and select the 1th area to display the data. 4 ax = fig.add_subplot (111) 5 # in the first column of the training set (the time consumed by playing online games) is the row of the data analysis diagram, the second column ( The number of ice cream litres consumed per week) is the column of the data analysis chart. 6 Ax.scatter (datingdatamat[:,1], Datingdatamat[:,2]) 7 # 8 plt.show ()
Also remember to include the required matplotlib Library at the top of the code:
1 # Import matplotlib Library 2 Import Matplotlib.pyplot as Plt 3 Import Matplotlib
After running, the output data analysis diagram is as follows:
A problem is found here, the above data analysis diagram does not show the results of the classification.
Further optimizing the data analysis diagram shows some of the code:
1 #Create a new Diagram object2Fig =plt.figure ()3 #set 1 rows and 1 columns of the plot area and select the 1th area to display the data. 4Ax = Fig.add_subplot (111)5 #in the first column of the training set (time spent playing online games) is the row of the data analysis chart, the second column (the number of ice cream litres consumed per week) is the column of the data analysis chart. 6Ax.scatter (datingdatamat[:,1], datingdatamat[:,2], 15.0*numpy.array (datinglabels), 15.0*Numpy.array (datinglabels))7 #Axis demarcation8Ax.axis ([ -2,25,-0.2,2.0])9 #Axis description (matplotlib configuration Chinese display a little trouble here directly in English well)TenPlt.xlabel ('Percentage of time spent Playing Online games') OnePlt.ylabel ('liters of Ice Cream consumed Per Week') A #Display data analysis diagram -Plt.show ()
Get the following data analysis diagram:
You can also use the same method to get "frequent flyer miles per year" and "Time spent playing online games" for the axis diagram:
Step three: Live it than the data
。。。。。。。。
Improving the pairing effect of dating sites using the K-Nearest neighbor algorithm