Learning notes for "Machine Learning Practice": two application scenarios of k-Nearest Neighbor algorithms, and "Machine Learning Practice" k-
After learning the implementation of the k-Nearest Neighbor Algorithm, I tested the k-Nearest Neighbor Algorithm by referring to the examples in machine learning practice, we mainly tested data classification for dating websites and handwriting recognition systems. These two tests use datasets provided by machine learning practices.
Before writing a function, add the following content to the. py file:
from numpy import *import numpy as npimport operatorfrom os import listdir
The first part is data classification for dating websites.Used to improve the matching effect of dating websites. The instance is described as follows:
Helen has been using the online dating site to find a suitable appointment. Although the dating website recommends different candidates, she did not find the person she liked. After summary, she found three types of contacts:
1. dislike (hereinafter referred to as 1 );
2. Attractive people (hereinafter referred to as 2 );
3. Attractive people (hereinafter referred to as 3)
Despite discovering the above rules, Helen still cannot classify matching objects recommended by dating websites into appropriate categories. She thinks that she can date people with average charm from Monday to Friday, while on weekends she prefers to be a companion to those with great charm. Helen hopes that our classification software can help her better classify matching objects into exact categories. In addition, Helen collected some data that had not been recorded on the dating site, which she believes is more helpful for matching object classification.
The purpose of this case is to classify the specified candidates (1, 2, or 3) based on some data information ). What information do we need to use kNN to achieve this goal? As mentioned above, we need sample data. After carefully reading the data, we found that the sample data is "Helen also collected some data that has not been recorded on the dating site ".
The following steps are required for the preceding description:
1. Collect data
Data collected by Helen records three characteristics of a person: the number of frequent flyer miles per year, the percentage of time spent on video games, and the number of liters of ice cream consumed per week. The data is a txt file. For example, the first three columns are three features in sequence, and the fourth column is classification (1: people who do not like it, 2: People who are attractive, 3: represents a very attractive person). Each row of data represents a person.
2. Prepare data
The computer needs to read data from the txt file, so it needs to format the data. For mathematical operations, the computer is good at storing the data in the matrix. In the following codefile2matrix(filename)
After the function completes this task, the input data file name (string) of the function outputs the training sample matrix and class label vector.
This process returns two matrices: A matrix is used to store three feature data of each person, and a matrix stores the classification of each person.
3. Design algorithms to analyze data
The idea of the k-Nearest Neighbor Algorithm is to find the first k closest samples of the test data, and then determine the classification of the data based on the classification of the k samples,Follow the "majority-dominant" Principle. Therefore, how to find samples becomes a major issue,In the field of signal processing and pattern recognition, "distance" is often used to measure the signal or feature similarity.. Here, we assume that we can use three feature data to replace everyone. For example, we use [40920, 8.326976, 0.953952] for the first person's attributes, and its classification is 3. In this case, the distance is the distance between points.
Find the distance between the test sample and each vertex in the training sample, and sort the samples from small to large. The first k digits are k-nearest neighbors, then, let's look at the classification that most occupies the k-bit nearest neighbor and get the final answer. This part is the core of the k-Nearest Neighbor Algorithm.classify()
The function implements the core part of the k-Nearest Neighbor Algorithm.
To optimize the algorithm performance, perform the following steps:
When you open a data file, we can find that the feature value represented in the first column is much greater than the other two features, so that the formula for distance calculation accounts for a large proportion, as a result, the distance between two samples depends largely on this feature. The characteristics of other features become dispensable, which is obviously different from the actual situation. Therefore, we can usually use the normalization mathematical tool to pre-process the data. The features after the processing do not affect the relative size and can be fair.Normalize(data)
The function implements this function.
4. Test the algorithm
After data preprocessing and normalization, The kNN algorithm can be verified. The test code is as follows:WebClassTest()
Since there are 1000 data records, we set a ratio.ratio = 0.1
That is, the order1000 * ratio = 100
Items are used as test samples, and others900
As a training sample, of course,ratio
The value can be changed, which affects the algorithm effect.
Implementation Code:
def classify(data, sample, label, k): SampleSize = sample.shape[0] DataMat = tile(data, (SampleSize, 1)) delta = (DataMat - sample)**2 distance = (delta.sum(axis = 1))**0.5 # sortedDist = distance.argsort() classCount = {} for i in range(k): votedLabel = label[sortedDist[i]] classCount[votedLabel] = classCount.get(votedLabel, 0) + 1 result = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True) return result[0][0]#print classify([10,0], sample, label, 3)def file2matrix(filename): fil = open(filename) fileLines = fil.readlines() # Convert the contents of a file into a list lenOfLines = len(fileLines) Mat = zeros((lenOfLines, 3)) classLabel = [] index = 0 for line in fileLines: line = line.strip() listFromLine = line.split('\t') Mat[index,: ] = listFromLine[0:3] classLabel.append(int(listFromLine[-1])) # the last one of listFromLine is Label index += 1 return Mat, classLabelmat,label = file2matrix('datingTestSet2.txt')#print mat# draw import matplotlibimport matplotlib.pyplot as pltfil = open('datingTestSet2.txt')fileLines = fil.readlines() # Convert the contents of a file into a listlenOfLines = len(fileLines)figure = plt.figure()axis = figure.add_subplot(111)lab = ['didntLike', 'smallDoses', 'largeDoses']for i in range(3): n = [] l = [] for j in range(lenOfLines): if label[j] == i + 1: n.append(list(mat[j,0:3])) l.append(label[j]) n = np.array(n) # list to numpy.adarray #reshape(n, (3,k)) axis.scatter(n[:,0], n[:,1], 15.0*array(l), 15.0*array(l), label = lab[i])print type(mat)print type(n)plt.legend()plt.show()def Normalize(data): minValue = data.min(0) maxValue = data.max(0) ValueRange = maxValue - minValue norm_data = zeros(shape(data)) k = data.shape[0] norm_data = data - tile(minValue, (k, 1)) norm_data = norm_data / tile(ValueRange, (k, 1)) return norm_data, ValueRange, minValuedef WebClassTest(): ratio = 0.1 dataMat, dataLabels = file2matrix('datingTestSet2.txt') normMat, ValueRange, minValue = Normalize(dataMat) k = normMat.shape[0] num = int(k * ratio) # test sample : 10% errorCount = 0.0 for i in range(num): result = classify(normMat[i,:], normMat[num:k,:],\ dataLabels[num:k], 7) # k = 3 print "The classifier came back with: %d, the real answer is %d"\ % (result, dataLabels[i]) if (result != dataLabels[i]): errorCount += 1 print "The total error rate is %f " % (errorCount / float(num))WebClassTest()
During the program design process, pay attention to the use of data structures such as list, array, and adarray. The functions of numpy. ndarray are different from those of array. array in the standard Python library. In the above Codeprint type(mat)
Andprint type(n)
It is to observe the type of each variable. The above code can be used to draw a scatter chart as follows:
The above scatter plots are drawn using the second and third-dimensional data in the dataset. Of course, you can use data from other dimensions to draw a two-dimensional scatter chart, or use data from all dimensions to draw a high-dimensional chart (not implemented), as shown in:
Test the classification of dating websites. Because the classification effect depends on the ratio of parameter k to the number of samples, the test starts based on the parameters in the book.k=3
, The proportion of the test sample to the total sample is0.1
The test results are as follows:
Theoretically, increasing the value of k can improve the accuracy. However, if the k value is too large, the accuracy will also decrease and the complexity of the operation will increase.
K = 7:
K = 17:
On the other hand, reducing the ratio value (that is, increasing the ratio of the training sample set) can also improve the accuracy of the algorithm. However, because each algorithm requires more samples, the complexity of the algorithm also increases.
The second part is handwritten digit recognition.:
First, let's take a look at the data set provided by books:
def img2vector(filename): returnVect = zeros((1,1024)) fr = open(filename) for i in range(32): lineStr = fr.readline() for j in range(32): returnVect[0,32*i+j] = int(lineStr[j]) return returnVectdef handwritingClassTest(): hwLabels = [] trainingFileList = listdir('trainingDigits') #load the training set m = len(trainingFileList) trainingMat = zeros((m,1024)) for i in range(m): fileNameStr = trainingFileList[i] fileStr = fileNameStr.split('.')[0] # take off .txt classNumStr = int(fileStr.split('_')[0]) hwLabels.append(classNumStr) trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr) testFileList = listdir('testDigits') # iterate through the test set errorCount = 0.0 mTest = len(testFileList) for i in range(mTest): fileNameStr = testFileList[i] fileStr = fileNameStr.split('.')[0] # take off .txt classNumStr = int(fileStr.split('_')[0]) vectorUnderTest = img2vector('testDigits/%s' % fileNameStr) classifierResult = classify(vectorUnderTest, trainingMat, hwLabels, 3) # k = 3 print "The classifier came back with: %d, the real answer is: %d"\ % (classifierResult, classNumStr) if (classifierResult != classNumStr): errorCount = errorCount + 1.0 print "\nThe total number of errors is: %d" % errorCount print "\nThe total error rate is: %f" % (errorCount/float(mTest))handwritingClassTest()
One result (k = 3
):
k = 7
The correct rate is not equalk = 3
Better Time:
In the process of Handwritten Digit Recognition, the accuracy decreases as the K value increases. The value of k is not larger, the better.
So far, k-Nearest Neighbor Algorithm learning and instance verification have been completed. Compared with other machine learning methods, the k-Nearest Neighbor Algorithm is the simplest and most effective data classification algorithm. When using algorithms, training sample data close to actual data must exist. However, as mentioned in the previous section, the algorithm has many disadvantages. The biggest difference is that the internal meaning of data cannot be given. In fact, the k-decision tree is an optimized version of the k-Nearest Neighbor Algorithm. Compared with the former, the decision tree effectively reduces the overhead of storage space and computing space, and further study is required later!
Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.