I. Overview
The K-Nearest neighbor algorithm is classified by measuring the distance between different eigenvalues.
How it works: first there is a collection of sample data (the training sample set), and each data in the sample data set has a label (classification), that is, we know that each data in the sample data corresponds to the owning category, after entering the data with no label, Compare each feature of the new data with the characteristics of the sample set's data (Euclidean distance operation) and then work out the classification label for the data that is closest to the most similar (nearest neighbor) feature in the sample dataset, typically we select the first K-like data in the sample dataset, Then, from the K data set, the classification with the most categorized classification as the new data is selected.
Ii. Advantages and Disadvantages
Advantages: High precision, insensitive to outliers, no data input assumptions.
Disadvantages: Complexity of computation and complexity of space.
Scope of application: Numerical type and nominal type
Three, the mathematical formula
European distance: Euclidean distance is the most easily understood distance calculation method, derived from the distance formula between two points in Euclidean space.
(1) Euclidean distance between two points a (x1,y1) and B (x2,y2) on a two-dimensional plane:
(2) Euclidean distance between two points a (X1,Y1,Z1) and B (x2,y2,z2) in three-dimensional space:
(3) Euclidean distance between two n-dimensional vector A (x11,x12,..., x1n) and B (x21,x22,..., x2n):
Third, the implementation of the algorithm
Pseudo-code of K-Nearest neighbor algorithm
For each point in the dataset of the unknown Type property, perform the following actions in turn:
(1) Calculate the distance between the point in the data set of the known category and the current point;
(2) Order by distance increment order;
(3) Select the nearest K points from the current point;
(4) Determine the frequency of occurrence of the category of K-point;
(5) Returns the category with the highest frequency of the first K points as the predicted classification of the current point.
1. Construction data
1 def createdataset (): 2 group = Array ([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) 3 labels = [' A ', ' a ', ' B ', ' B ']4 Return group, Labels
There are 4 sets of data, the columns of each group represent the eigenvalues of different attributes, and the vector labels contains the label information for each data point, or it can be called a classification. There are two types of data, A and B.
2. Implementation algorithm
Tile: Repeats an array. For example, Tile (a,n), the function is to repeat the array a n times to form a new array.
1 >>> tile ([+], (4)) 2 array ([1, 2, 1, 2, 1, 2, 1, 2]) 3 >>> tile ([up], (4,1)) 4 Array ([[1, 2], 5 [ 1, 2], 6 [1, 2], 7 [1, 2]] 8 >>> tile ([up], (4,2)) 9 Array ([[1], 2, 1, 2],10 [1, 2, 1, 2],11 [1, 2, 1, 2],12 [1, 2, 1, 2]])
Euclidean distance algorithm implementation:
1 def classify0 (InX, DataSet, labels, k): 2 datasetsize = dataset.shape[0] 3 diffmat = Tile (InX, (datasetsize,1)) -DataSet #新数据与样本数据每一行的值相减 [[ [X-x1,y-y1],[x-x2,y-y2],[x-x3,y-y3],.....] 4 sqdiffmat = diffmat**2 #数组每一项进行平方 [[ (x-x1) ^2, (y-y1) ^2],........] 5 sqdistances = Sqdiffmat.sum (Axis=1) #数组每个特证求和 [[(X-xi) ^2+ (y-yi) ^2],......] 6 distances = sqdistances**0.5 # Each value of the array is opened with a square root , and the Euclidean distance formula is completed .... 7 sorteddistindicies = Distances.argsort () #argsort函数返回的是数组值从小到大的索引值 8 classcount={} # The following is a selection of the smallest distance of the first K value index, from K to select the most classified as the new data classification 9 for I in range (k): # Statistics before the K points belong to the category of ten Voteilabel = labels[ Sorteddistindicies[i]]11 Classcount[voteilabel] = Classcount.get (voteilabel,0) + Sortedclasscount = Sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true) return sortedclasscount[0][0]# Returns the category with the highest frequency in the first K points
Where InX: New data that needs to be sorted, DataSet: Sample data characteristics, Labels: sample Data classification, K: Select the top k nearest distance
Test algorithm:
1 >>> group,labels = Knn.createdataset () 2 >>> GROUP,LABELS3 (Array ([[[1], 1.1],4 [1., 1. ],5 [0., 0.],6 [0.,
Test result: [0,0] belongs to Category B.
3. How to test the classifier
Iv. Example: using the K-nearest neighbor algorithm to improve the pairing effect of a dating site
My friend Helen has been using the online dating site to find the right date for her. Although dating sites recommend different candidates, she doesn't like everyone. After a summary, she found that there were three types of people who had intercourse:
- Someone you don't like.
- A person of general charm
- A man of great charm
Helen wants our classification software to be better able to help her classify the matching objects into the exact categories. In addition, Helen collects data that has not been recorded on dating sites, and she believes the data is more useful for matching object collations.
1. Preparing data: Parsing data from a text file
The data is stored in the text file DatingTestSet.txt, where each sample data occupies a row, with a total of 1000 rows.
Helen's sample mainly consists of the following 3 characteristics:
- Number of frequent flyer miles earned per year
- Percentage of time spent playing video games
- Number of ice cream litres consumed per week
2. Analyze data: Create scatter plots using matplotlib
Scatter plots use the first and second columns of the Datingdatamat matrix to represent the eigenvalues "frequent flyer miles per year" and "percentage of time spent playing video games".
Number of frequent flyer miles earned per year vs. the percentage of playing video games scatter chart of dating data
3. Prepare data: Normalized value
Different eigenvalues have different mean values and range of values, if the distance is calculated directly using eigenvalue, the feature with large range will have an absolute effect on the result of distance calculation, and make the smaller eigenvalues almost useless, which is nearly useless. such as two sets of characteristics: {0, 20000, 1.1} and {67, 32000, 0.1}, the calculation of the distance is:
Obviously the second feature will have an absolute effect on the result, and the first and third features hardly work.
However, for the process of recognition, we think that the different characteristics are equally important, so as one of the three equal weight characteristics, frequent flyer mileage should not affect the results of the calculation so seriously.
When dealing with eigenvalues of this different range of values, we usually use the method of normalized values, such as the range of values to be processed from 0 to 1 or 1 to 1. The following formula converts the eigenvalues of any range of values into values from 0 to 1 intervals:
NewValue = (oldvalue–min)/(Max–min)
where Min and Max are the smallest eigenvalues and maximum eigenvalues in the dataset, respectively.
Add the Autonorm () function for normalization of numeric eigenvalues:
1 def autonorm (dataSet): 2 minvals = dataset.min (0) # The minimum value for each feature is 3 maxvals = Dataset.max (0) # The maximum value of each feature is calculated separately 4 Ranges = maxvals-minvals# range of values for each feature 5 normdataset = zeros (Shape (dataSet)) 6 m = dataset.shape[0]7 Normdataset = Dataset-tile (Minvals, (m,1)) oldvalue-min8 normdataset = normdataset/tile (ranges, (m,1)) (oldvalue-min)/(max-min) data normalization process 9 return normdataset, ranges, minvals
For this function, pay attention to the return results in addition to the normalized data, but also for the normalized range value ranges and the minimum value of minvals, which will be used for the normalization of the test data.
Note that the normalization process for the test data set must use the same parameters as the training dataset (ranges and Minvals), and cannot calculate ranges and minvals separately for the test data, otherwise it will result in inconsistencies in the same set of data in the training dataset and the test data set.
4. Test algorithm: As a complete program verification classifier
A very important task of machine learning algorithm is to evaluate the correctness of the algorithm, usually we only provide 90% of the existing data as a training sample to train the classifier, and use the remaining 10% data to test the classifier to detect the correct rate of the classifier. It should be noted that 10% of the test data should be randomly selected. Since the data provided by Helen is not ordered for a specific purpose, we can arbitrarily select 10% data without affecting its randomness.
Create a classifier test code for an appointment site: Test algorithm with sample set data
1 def datingclasstest (): 2 hoRatio = 0.50 #hold out 10% 3 datingdatamat,datinglabels = File2matrix (' DatingTestSet2.txt ') #load data setfrom file 4 normmat, ranges, minvals = Autonorm (datingdatamat) 5 m = normm At.shape[0] 6 numtestvecs = Int (m*horatio) 7 errorcount = 0.0 8 for i in range (numtestvecs): 9 classifier Result = Classify0 (normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3) print "the Classifier came back with:%d, the real answer is:%d "% (Classifierresult, datinglabels[i]) if (Classifierresult!) = Datinglabels[i]): Errorcount + = 1.012 print "The total error rate is:%f"% (Errorcount/float (numtestvecs)) Print Errorcount
To execute the classifier test program:
1 >>> knn.datingclasstest () 2 3 The classifier came back with:2, the real answer is:1 4 5 The Classifie R came back With:2, the real answer Is:2 6 7 The classifier came back with:1, the real answer is:1 8 9 The CLA Ssifier came back with:1, the real answer is:110 one of the classifier came back with:2, the real answer is:212 13 .... .......................................... All error rate is:0.06400016 17 32.0
The error rate for the classifier to process the appointment dataset is 6.4%, which is a fairly good result. We can change the value of the variable Horatio and the variable K in the function datingclasstest, and detect whether the error rate increases with the change of the variable value.
This example shows that we can predict the classification correctly, the error rate is only 2.4%. Helen could have entered the attribute information of an unknown object, which was used by the classification software to help her determine the degree of engagement of an object: annoying, generally like, very fond.
5, using the algorithm: Build a complete system available
Combining the above code, we can build a complete dating site prediction function: the input data needs to be normalized
1 def classifyperson (): 2 resultlist = [' Not @ all ', ' in small doses ', ' large doses '] 3 percenttats = float (ra W_input ("Percentage of Time spent playing video game?")) 4 ffmiles = float (raw_input ("Frequent flier miles earned per year")) 5 icecream = float (raw_input ("liters of ice Cream consumed per year? ")) 6 datingdatamat, datinglabels = File2matrix (' datingTestSet.txt ') 7 normmat, ranges, minvals = Autonorm ( Datingdatamat) 8 Inarr = Array ([Ffmiles, Percenttats, icecream]) #新数据 need to be normalized 9 Classifierresult = Classify (( inarr-minvals)/ranges, Normmat, Datinglabels, 3) print "You'll probably like this person :", Resultlist[classi FIERRESULT-1]
So far, we've seen how to build classifiers on data.
Full code:
View Code
Category: Machine learning Algorithms
K-Nearest Neighbor algorithm