K-Nearest neighbor algorithm to improve the pairing effect of dating sites One, theoretical study 1. Read the content

Please be sure to read the "machine Learning Combat" book 1th and 2nd chapters, this section of the experiment by solving dating site matching effect problem to combat`k-近邻算法（k-Nearest Neighbour，KNN）`

2. Extended Reading

This section of the recommended content can assist in the book of theoretical knowledge, more easily understood than the contents of the book, can deepen the theoretical knowledge, please read carefully:

- Cool Shell-K Nearest Neighbor algorithm
- --k nearest neighbor algorithm for data mining ten algorithms

Second, on-line experiment 1. Analyze requirements

The people I see on dating sites are divided into three categories:

- Don't like the
- Generally like the
- Very much.

I hope that the classification algorithm can be implemented for me to distinguish the three categories of people.

And now I have a group of people with the following data:

- Annual Flight mileage
- Percentage of time spent playing video games
- Ice cream (litres) eaten weekly

How do you classify this group of people based on these data? That's what I need.

2. Prepare data Download Data

Download the required information for your experiment:

`$ cd /home/shiyanlou$ wget http://labfile.oss.aliyuncs.com/courses/499/lab2.zip$ unzip lab2.zip`

Copy the test data to our own directory:

`$ cd /home/shiyanlou$ mkdir mylab2$ cd mylab2/$ cp /home/shiyanlou/lab2/datingTestSet2.txt ./`

Open the data file using Gedit or Vim to view the following:

Each row in the file represents one person's data, with a total of 4 columns, respectively:

- Annual Flight mileage
- Percentage of time spent playing video games
- Ice cream (litres) eaten weekly
- Categories divided into (1: dislike 2: General like 3: very much like)

Based on these data we begin to implement the KNN algorithm.

We need to open a XFCE terminal, enter into `ipython`

interactive mode to write code side test.

Parsing data

In order to allow the KNN algorithm to process our data, we need to read the data into the matrix.

We implement the function `file2matrix()`

to do the data parsing:

- Read data file per row
- Output as feature matrix and class label vector

The implementation of the function is as follows, note that the numpy is used to construct the matrix:

`From NumPyImport *DefFile2matrix(filename):# Open Data file, read each line of content fr = open (filename) arrayolines = Fr.readlines ()# initialization matrix NumberOfLines = Len (arrayolines) Returnmat = Zeros ((Numberoflines,3)" # Initialize class tag vector classlabelvector = [] # loops through each row of data index = 0 for line in arrayOLines: Span class= "hljs-comment" ># remove carriage return line = Line.strip () # extract 4 data Items Listfromline = Line.split ( Span class= "hljs-string" > ' t ') # put the first three data into the Matrix Returnmat[index,:] = Listfromline[0:3] # the fourth data is stored in vector classlabelvector.append (int ( Listfromline[-1])) Index + = 1 return returnmat,classlabelvector `

Enter the above code into Ipython and note the indentation of the Python code.

After completion we test the function to read the return value of the data:

This time we see the first data in the matrix is thousands of or even tens of thousands, and the second and third data is much smaller, in order to avoid the KNN distance when the first effect is too large, the data needs to be processed in the second step: normalization, the value range is processed into `0~1`

or `-1~1`

between.

In the experiment we will use the following formula to return the value to the `0~1`

range:

`newValue = (oldValue-min)/(max-min)`

Implement a `autoNorm()`

function to complete the normalization of the data, where the function of NumPy `tile()`

is `minVals`

processed into a matrix of the `dataSet`

same size, so that matrix subtraction can be performed, and the Matrix division is finally normalized to the matrix:

`def autonorm # reads the maximum and minimum values of data items in the Matrix Minvals = Dataset.min (0) maxvals = Dataset.max (0) # get the difference between maximum and minimum value ranges = Maxvals-minvals # initialize output Normdataset = zeros (Shape (dataSet)) 0] # Matrix Operations: Implementing the Oldvalue-min step in the normalization formula No Rmdataset = Dataset-tile (Minvals, (M,1)) # Matrix Division: Implementing division in a normalized formula Normdataset = Normdataset/tile (ranges, (M,1)) # returns the normalized data, Data range and minimum matrix return normdataset, Ranges, minvals `

After the completion of the data just received to the normalization process:

For this data we can use the `matplotlib`

create scatter plot to view.

The scatter chart created by the above steps shows the percentage of elapsed time spent playing video games after normalization, and the vertical axis is the amount of ice cream consumed per week data.

The final picture is shown as follows:

3. Analyze data

KNN algorithm We have some knowledge in the Theory Learning section, this section will implement the core part of the algorithm: Calculate "Distance".

How to calculate the distance between two people on the basis of three characteristic data, assuming that the data of the normalized two individuals is `(0.3,0.5,0.3)`

and `(0.5,0.2,0.3)`

uses a simple formula to calculate the distance between the two:

(A1−B1) a2−b2 (A3−B3) 2\sqrt{(A1-B1) ^2 + (A2-B2) ^2 + (A3-B3) ^2}√?(A1−B1)?2??+(A2−B2)? 2?? + (a3 −b3 ?2?? ?

The distance to get these two points is 0.3606.

When we have a sample of the data and the classification of the data, we enter a test data, and we can find out according to the algorithm which category the test data belongs to.

KNN Algorithm Implementation process:

- Calculates the distance between the test data and each sample data.
- Sort by distance increment
- Select the nearest K-point from the test data (note that the K value here will affect the final classification error rate)
- Determine the probability of the occurrence of three categories in this K-point, and the category with the highest probability of occurrence is the result of the return
- Output: The algorithm determines that the test data belongs to the category calculated in step 4

The algorithm is implemented as a function `classify0()`

, and the parameters of the function include:

- InX: Test Data vector
- DataSet: Sample Data Matrix
- Labels: class label vector for sample data
- K: The legendary K-value

`Import operatorDefClassify0(InX, DataSet, labels, k):# Get Sample Data Quantity Datasetsize = dataset.shape[0]# matrix operation, calculates the difference between the test data and the corresponding data item for each sample data Diffmat = Tile (InX, (Datasetsize,1))-DataSet# sqdistances Previous step results squared and Sqdiffmat = diffmat**2 sqdistances = Sqdiffmat.sum (axis=1) # take square root, get distance vector distances = sqdistances** 0.5 # sort by distance from low to high sorteddistindicies = Distances.argsort () classcount={} # take out recent sample data for i # records the category to which the sample data belongs Voteilabel = Labels[sorteddistindicies[i]] classcount[ Voteilabel] = classcount.get (Voteilabel,0) + 1 # sorts the frequency of category occurrences, from high to low sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), Reverse=true) # returns the highest frequency category return sortedclasscount[0][0] `

4. Test algorithm

After we have finished analyzing the data, we begin to test the accuracy of the algorithm. The test procedure calls the functions we implemented above.

It needs to be explained first that our sample data and test data come from a data file, we use the data file `10%`

as the test data, and then `90%`

as the sample data.

Steps to test:

- Reading sample data from data files to feature matrices and class label vectors
- Normalization of the data to get the normalized characteristic matrix
- Perform KNN algorithm to test the test data, get the classification result
- Compare with actual classification results, record classification error rate
- Print each person's classification data and error rate as the final result

Test function Implementation:

`DefDatingclasstest():# Set the scale of the test data HoRatio =0.10# read Data Datingdatamat,datinglabels = File2matrix (' DatingTestSet2.txt ')# normalized data normmat, ranges, minvals = Autonorm (Datingdatamat)# Data Total number of rows m = normmat.shape[0]# Number of test data rows numtestvecs = Int (m*horatio)# initialization error Rate Errorcount = 0.0 # Loop-read test data per line for i in Range (Numtestvecs): # the tester is classified Classifierresult = Classify0 (normmat[i,:],normmat[numtestvecs:m,:], Datinglabels[numtestvecs:m],3) # print KNN algorithm classification results and real classification print "The classifier came back with:%d, the real answer is:%d"% (Classi Fierresult, Datinglabels[i]) # to determine whether the KNN algorithm results accurately if ( Classifierresult! = datinglabels[i]): Errorcount + = 1.0 # print error rate Span class= "Hljs-keyword" >print "The total error rate is:%f"% (Errorcount/float (numtestvecs ))`

Some data in the test, where the sample data 900 rows, the test data 100 rows:

Run the program to test:

Finally we get the error rate of 0.05, you can consider what can reduce the error rate, such as those values can affect the final error rate. Welcome to the lab building quiz with teachers and classmates to discuss.

Iii. Classic Questions and Answers

The content of this section is constantly updated to list the valuable questions that students have mentioned in the lab building for reference.

Iv. Complete Data and code

Complete data and reference codes can be downloaded via wget:

`$ cd /home/shiyanlou/$ wget http://labfile.oss.aliyuncs.com/courses/499/lab2.zip$ unzip lab2.zip`

Full code:

`#-*-Coding:utf-8-*-From NumPyImport *Import operator# Read data to MatrixDefFile2matrix(filename):# Open Data file, read each line of content fr = open (filename) arrayolines = Fr.readlines ()# initialization Matrix NumberOfLines = Len (arrayolines) Returnmat = Zeros ((NumberOfLines,3))# Initialize class tag vector classlabelvector = []# Loop reads each row of data index =0For lineIn Arrayolines:# Remove Carriage return line = Line.strip ()# Extract 4 data Items Listfromline = Line.split (' \ t ')# put the first three data into the matrix Returnmat[index:] = listfromline[0:3]# The fourth data is stored in vector classlabelvector.append (int (listfromline[-1])) Index + =1Return Returnmat,classlabelvector# Normalization of dataDefAutonorm(DataSet):# reads the maximum and minimum values of data items in the Matrix Minvals = Dataset.min (0) Maxvals = Dataset.max (0)# Get the difference between maximum and minimum ranges = Maxvals-minvals# Initialize Output Normdataset = zeros (Shape (dataSet))# Gets the number of rows of the matrix M = dataset.shape[0]# Matrix Operations: Implementing the Oldvalue-min step in the normalization formula Normdataset = Dataset-tile (Minvals, (M,1))# Matrix Division: Implement the Division in the normalization formula Normdataset = Normdataset/tile (ranges, (M,1))# Returns the normalized data, the data range and the minimum value matrixReturn normdataset, ranges, minvals# KNN Algorithm ImplementationDefClassify0(InX, DataSet, labels, k):# Get Sample Data Quantity Datasetsize = dataset.shape[0]# matrix operation, calculates the difference between the test data and the corresponding data item for each sample data Diffmat = Tile (InX, (Datasetsize,1))-DataSet# sqdistances Previous step results squared and Sqdiffmat = diffmat**2 sqdistances = Sqdiffmat.sum (axis=1)# take the square root, get the distance vector distances = sqdistances**0.5# Sort by distance from low to high sorteddistindicies = Distances.argsort () classcount={}# Remove the nearest sample data in turnFor IIn range (k):# record the class to which the sample data belongs Voteilabel = Labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (Voteilabel,0) +1# Sort the frequency of category occurrences, from high to low sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True)# Return to the category with the highest frequencyReturn sortedclasscount[1]:0]# algorithm TestDefDatingclasstest():# Set the scale of the test data HoRatio =0.10# read Data Datingdatamat,datinglabels = File2matrix (' DatingTestSet2.txt ')# normalized data normmat, ranges, minvals = Autonorm (Datingdatamat)# Data Total number of rows m = normmat.shape[0]# Number of test data rows numtestvecs = Int (m*horatio)# initialization error Rate Errorcount =0.0 # loop reading test data per line for I in Range (numtestvecs): # classify the tester Classifierresult = Classify0 (Normmat[i, :],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3) # print KNN algorithm classification results and real classification print "the Classifier came back with:%d, the real answer is:%d "% (Classifierresult, datinglabels[i]) # Determine if the results of KNN algorithm are accurate if (classifierresult! = Datinglabels[i]): Errorcount + = 1.0 # print error rate print " The total error rate is:%f "% (Errorcount/float (numtestvecs)) # perform algorithm test datingclasstest () `

The K-Nearest neighbor algorithm improves the pairing effect of dating sites