Case two: using the K-nearest neighbor algorithm to improve the pairing effect of dating sites
Case Analysis:
Helen collects data sets in three categories, namely, the number of frequent flyer miles earned each year, the percentage of time spent playing video games, and the number of ice cream litres consumed per week. We need to compare each feature of each new data in the new data to the one that corresponds to the data in the sample set, and then extract the classification label of the most similar data (nearest neighbor) in the sample set. In general, we only select the first k most similar data in the sample data set, which is the source of K in the K-nearest neighbor algorithm, usually K is an integer not greater than 20. Finally, select the most frequently occurring classification of the K most similar data as the classification of the new data.
Process: Using the K nearest neighbor algorithm on the dating site
(1) Collect data: Provide text files.
(2) _ Prepare data: Use Python to parse text files.
(3) Analysis data: Use Matplotlib to draw two-dimensional diffusion map.
(4) Training algorithm: This step does not apply to the K nearest neighbor algorithm.
(5) test algorithm: Use some of the data provided by Helen as a test sample. The difference between a test sample and a non-test sample is that the test sample is the data that has been sorted, and if the forecast classification differs from the actual category, it is marked as an error.
(6) Use algorithm: Generate a simple command-line program, then Helen can enter some characteristic data to determine whether the other person is the type of their liking.
Data Set Styles
Converting from a file read data set to a two-dimensional array and class vector
Code:
return returnMat,classLabelVector
After reading the data file, display:
Plot scatter plots, scatter plots use the second and third columns of the Datingmat matrix, respectively, to represent features
Value "Percentage of time spent playing video games" and "number of ice cream litres consumed per week".
Code:
From numpy Import *
From IMP import reload
Import Matplotlib
Import Matplotlib.pyplot as Plt
Import KNN
Datingmat,labelvector = Knn.file2matrix (' datingTestSet2.txt ')
Plt.figure (figsize = (8,5), DPI = 80)
Axes = Plt.subplot (111)
type1_x = []
Type1_y = []
type2_x = []
Type2_y = []
type3_x = []
Type3_y = []
For I in range (len (labelvector)):
#不喜欢
If labelvector[i] = = 1:
Type1_x.append (Datingmat[i][0])
Type1_y.append (Datingmat[i][1])
#魅力一般
Elif Labelvector[i] = = 2:
Type2_x.append (Datingmat[i][0])
Type2_y.append (Datingmat[i][1])
#魅力超群
Else
Type3_x.append (Datingmat[i][0])
Type3_y.append (Datingmat[i][1])
Typefirst = Axes.scatter (type1_x, type1_y, s=20, c= ' red ')
Typesecond = Axes.scatter (type2_x, type2_y, s=40, c= ' green ')
Typethird = Axes.scatter (type3_x, type3_y, s=60, c= ' Blue ')
Plt.xlabel (U ' miles earned per year ')
Plt.ylabel (percentage of events consumed by U ' playing video games)
Axes.legend ((Typefirst,typesecond,typethird), (U ' Don't like ', U ' charm General ', U ' Charismatic '), loc = 2)
Plt.show ()
Show Picture:
Normalization of data
If we look closely at the data set, the impact of the number of frequent flyer miles earned per year on the results will be much greater than the other two features in the dataset-playing video games and spending weekly ice flooding litres. The only reason for this is the fact that frequent flyer mileage is much larger than other eigenvalues. But Helen believes that these three characteristics are equally important, so as one of the characteristics of three equal weight, frequent flyer mileage should not affect the results of the calculation so seriously.
When dealing with eigenvalues of this different range of values, we usually use the method of normalized values, such as the range of values to be processed from 0 to 1 or 1 to 1. The following formula converts the eigenvalues of any range of values into values from 0 to 1 intervals:
N e W v A l u e = {o l d V a l u e-m i N)/(Max-min)
where Min and Max are the smallest eigenvalues and maximum eigenvalues in the dataset, respectively. Although changing the value range of the values increases
The complexity of the classifier, but in order to get accurate results we have to do this. We need to add a new function autonorm () to the file knn.py, which automatically converts the numeric eigenvalues to a range of 0 to 1.
Code:
return normDataSet,ranges,minVals
Normalization effect:
Test algorithm
Machine learning algorithm an important task is to evaluate the correctness of the algorithm, usually we only provide 90% of the existing data as a training sample to train the classifier, while using the remaining 10% data to test the classifier, to detect the correct rate of the classifier. It should be noted that 10% of the test data should be randomly selected, because the data provided by Helen is not sorted by a specific purpose, so we can choose 10% data without affecting its randomness.
for classifiers, the error rate is the number of times the classifier gives incorrect results divided by the total number of test data, the error rate of the perfect classifier is 0, and the classifier with the error rate of 1.0 does not give any correct classification results. In the code we define a counter variable, each time the classifier incorrectly classifies the data, the counter adds 1, and the result of the counter after the execution of the program is divided by the total number of data points is the error rate.
Code:
print("the total error rate is %f" % (errorCount/float(numTestVecs)))
Test:
Dating site Prediction function
def classifyPerson(): resultList = [‘not at all‘,‘in small doses‘,‘in large doses‘] percentTats = float(input("percentage of time spent playing video games?")) ffMiles = float(input("frequent flier miles earned per year")) iceCream = float(input("liters of ice cream consumed per year?")) datingDataMat,datingLabels = file2matrix(‘datingTestSet2.txt‘) normMat, ranges, minvals = autoNorm(datingDataMat) inArr = array([ffMiles, percentTats, iceCream]) classifierResult = classify0((inArr-minvals)/ranges, normMat,datingLabels,3) print("you will probably like this person: %s" % resultList[classifierResult-1])
Effect:
Machine learning Combat NOTE-K neighbor algorithm 2 (improved pairing effect for dating sites)