an overview of K nearest Neighbor algorithm
To put it simply, K nearest neighbor algorithm uses the distance method to measure the different eigenvalues to classify.
Advantages: High precision, insensitive to outliers, no data input assumptions.
Disadvantages: High computational complexity and high spatial complexity.
Applicable data range: Numerical and nominal type.
It works by having a collection of sample data, also known as a training sample set, and a label for each of the data in the sample set, that is, we know the correspondence of each data in the sample set to the owning category. After losing new data with no tags, each feature of the new data is compared with the feature in the sample set, and then the algorithm extracts the classification label of the most similar data (nearest neighbor) in the sample set. In general, we only select the first k most similar data in the sample dataset, which is the source of K in the K nearest neighbor algorithm, usually K is an integer not greater than 20. Finally, select the most frequently occurring classification of the K most similar data as the classification of the new data.
General flow of K-Nearest neighbor algorithm
(1) Collect data: You can use any method.
(2) Prepare the data: The value required for the distance calculation, preferably a structured data format.
(3) Analysis data: Any method can be used.
(4) Training algorithm: This step does not apply to the 1 nearest neighbor algorithm.
(5) test algorithm: Calculate the error rate.
(6) The use of the algorithm: the first need to input sample data and structured output results, and then run the K-nearest neighbor algorithm to determine the input data belong to which classification, and finally applied to the computed classification to perform subsequent processing.
Pseudo-code implementation of K-nearest neighbor algorithm
For each point in the dataset of the Unknown category property, do the following:
(1) Calculate the distance between the point in the data set of the known category and the current point, and the Euclidean distance formula is often used;
(2) Sorting in ascending order of distance;
(3) Select K points with the minimum distance from the current point;
(4) Determine the frequency of occurrence of the category of the first k points;
(5) Returns the category with the highest frequency of the first K points as the predicted classification of the current point.
def classify0 (InX, DataSet, labels, k):
datasetsize = dataset.shape[0]
Diffmat = Tile (InX, (datasetsize, 1))-Dat ASet
Sqdiffmat = Diffmat * * 2
sqdistances = Sqdiffmat.sum (Axis=1)
distances = sqdistances * * 0.5
Sorteddistindicies = Distances.argsort ()
ClassCount = {} for
I in range (k):
Voteilabel = labels[ Sorteddistindicies[i]]
Classcount[voteilabel] = classcount.get (Voteilabel, 0) + 1
sortedclasscount = sorted ( Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true)
return sortedclasscount[0][0]
using K-Nearest neighbor algorithm to improve the pairing effect of dating sites
(1) Collect data: Provide text files.
(2) Preparing data: Parsing text files using Python.
(3) using Matplotlib to draw two-dimensional diffusion graphs
(4) Training algorithm: This step does not apply to the K nearest neighbor algorithm.
(5) test algorithm: Use some of the data provided by Helen as a test sample.
The difference between a test sample and a non-test sample is that the test sample is the data that has been sorted, and if the forecast classification differs from the actual category, it is marked as an error.
(6) Use algorithm: Generate a simple command-line program, then Helen can enter some characteristic data to determine whether the other person is the type of their liking.
parsing data from a text file
Helen collects dating data for some time, she kept the data in a text file, with each sample data occupying a line of 1000 rows. Helen's sample mainly consists of the following 3 characteristics:
-Number of frequent flyer miles earned per year
-Percentage of time spent playing video games
-The number of ice cream litres consumed per week
We read the data through the File2matrix function, and the output of the function's input file string is the training sample matrix and the class label vector.
def file2matrix (filename):
fr = open (filename)
numberoflines = Len (Fr.readlines ()) # Get the number of lines In the file
Returnmat = Zeros ((NumberOfLines, 3)) # Prepare matrix to return
classlabelvector = [] # PREPA Re labels return
FR = open (filename)
index = 0 for line in
Fr.readlines (): Line
= Line.strip ()
lis Tfromline = Line.split (' \ t ')
returnmat[index,:] = Listfromline[0:3]
classlabelvector.append (int ( LISTFROMLINE[-1])
index + = 1
return Returnmat, Classlabelvector
Analyze data: Create scatter plots with matplotlib
Under the Python command environment, enter the following command
Import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure ()
ax = fig.add_subplot (111)
Ax.scatter (Datingdatamat[:,1],datingdatamat[:2])
plt.show ()
It is difficult to see useful data schema information from the graph because the sample classification eigenvalues are not used.
Re-enter the code above and use the following code when calling the scatter function:
Ax.scatter (Datingdatamat[:,1],datingdatamat[:,2],15.0*array (datinglabels), 15.0*array (DatingLabels))
Normalization of data
It is easy to see that the maximum number of values in the above equation has the greatest impact on the results, meaning that the number of frequent flyer miles earned per year will have a much greater impact on the results than the other two features in table 2-3-playing video games and weekly consumption of ice flooding litres. The only reason for this is the fact that frequent flyer mileage is much larger than other eigenvalues. But Helen believes that these three characteristics are equally important, so as one of the characteristics of three equal weight, frequent flyer mileage should not affect the results of the calculation so seriously.
When dealing with eigenvalues of this different range of values, we usually use the method of normalized values, such as the range of values to be processed from 0 to 1 or 1 to 1. The following formula converts the eigenvalues of any range of values into values from 0 to 1 intervals:
Newvalue= (oldvalue-min)/(Max-min)
where Min and Max are the smallest eigenvalues and maximum eigenvalues in the dataset, respectively. Although changing the value range increases the complexity of the classifier, we must do so in order to get accurate results
def autonorm (dataSet):
minvals = dataset.min (0)
maxvals = Dataset.max (0)
ranges = Maxvals-minvals
Normdataset = Zeros (Shape (dataSet))
m = dataset.shape[0]
normdataset = Dataset-tile (Minvals, (M, 1))
Normdataset = Normdataset/tile (ranges, (M, 1)) # element wise divide
return normdataset, ranges, minvals
Test Algorithm
We mentioned earlier that we can use the error rate to detect the performance of the classifier. For classifiers, the error rate is the number of times the classifier gives incorrect results divided by the total number of test data, the error rate of the perfect classifier is 0, and the classifier with the error rate of 1.0 does not give any correct classification results. In the code we define a counter variable, each time the classifier incorrectly classifies the data, the counter adds 1, and the result of the counter after the execution of the program is divided by the total number of data points is the error rate.
Def datingclasstest ():
hoRatio = 0.50 # hold out 10%
datingdatamat, datinglabels = File2matrix (' DatingTestSet2.txt ') # Load Data setfrom file
normmat, ranges, minvals = Autonorm (datingdatamat)
m = Normmat . shape[0]
numtestvecs = int (M * hoRatio)
errorcount = 0.0 for
i in range (numtestvecs):
classifierresult = Classify0 (Normmat[i,:], normmat[numtestvecs:m,:], datinglabels[numtestvecs:m], 3)
print "The classifier came Back with:%d, the real answer is:%d "% (Classifierresult, datinglabels[i])
if (classifierresult! = Datinglabels[i]) : Errorcount + = 1.0
Print "The total error rate is:%f"% (Errorcount/float (numtestvecs))
print Errorcount
The error rate for the classifier to process the appointment dataset is 2.4%, which is a fairly good result. Depending on the classification algorithm, dataset, and program settings, the output of the classifier can vary greatly.
This example shows that we can predict the classification correctly, the error rate is only 2.4%. Helen can totally lose. Attribute information of unknown object ' by the classification software to help her to determine the degree of interaction of an object: annoying, generally like, very like. Handwriting recognition System
In this section, we construct a handwriting recognition system that uses K-nearest neighbor classifier step-by-step. For simplicity, the system constructed here can only recognize numbers 0 to 9, see Figure 2.6. The numbers that need to be identified have been processed into the same color and Size ® using the graphics processing software: Wide-high is a black-and-white image of 32 pixels *32 pixels. Although storing images in text format does not make efficient use of memory space, we convert images to text format for ease of understanding.
(1) Collect data: Provide text files.
(2) Prepare the data: Write the function classify0 () and convert the image format to the list format used by the classifier.
(3) Analyze data: Check the data at the Python command prompt to make sure it meets the requirements.
(4) Training algorithm: This step does not apply to the nearest neighbor algorithm.
(5) test algorithm: The writing function uses the provided partial data set as the test sample, the difference between the test sample and the non-test sample is that the test sample is the data that has been sorted, and if the forecast classification is different from the actual category, the mark
As an error.
(6) Using the algorithm: This example does not complete this step, if you are interested in building a complete application, extracting numbers from the image, and completing the digital recognition, the U.S. Mail sorting system is a similar system actually running
In order to use the classifier of the previous two examples, we must format the image as a vector. We will convert a 32*32 binary image matrix into a vector of 1*1024 so that the classifier used in the first two sections can process the digital image information.
We first write a function img2vector to convert the image to a vector: the function creates an 1*1024 numpy array, then opens the given file, loops through the first 32 lines of the file, stores the first 32 character values of each row in the NumPy array, and returns the array.
def img2vector (filename):
returnvect = Zeros ((1, 1024x768))
fr = open (filename) for
i in range:
linestr = Fr.readline ()
for J in Range (+):
returnvect[0, + * i + j] = Int (linestr[j])
return Returnvect
Test Algorithm
Def handwritingclasstest (): Hwlabels = [] trainingfilelist = Listdir (' trainingdigits ') # Load the training set m = Len (trainingfilelist) Trainingmat = Zeros ((M, 1024x768)) for I in Range (m): Filenamestr = Trainingfileli St[i] Filestr = Filenamestr.split ('. ') [0] # take off. txt classnumstr = int (Filestr.split ('_') [0]) hwlabels.append (CLASSNUMSTR) trainin Gmat[i,:] = Img2vector (' trainingdigits/%s '% filenamestr) testfilelist = Listdir (' testdigits ') # Iterate through the Test set errorcount = 0.0 mtest = Len (testfilelist) for I in Range (mtest): Filenamestr = testfilelist [i] filestr = Filenamestr.split ('. ') [0] # take off. txt classnumstr = int (Filestr.split ('_') [0]) Vectorundertest = Img2vector (' testdigits/%s ') % filenamestr) Classifierresult = Classify0 (Vectorundertest, Trainingmat, Hwlabels, 3) print "The Classif Ier came back with:%d, the real answer is:%d "% (classIfierresult, Classnumstr) if (classifierresult! = classnumstr): Errorcount + = 1.0 print "\nthe total number of Errors is:%d "% errorcount print" \nthe total error rate is:%f "% (Errorcount/float (mtest))
K-Nearest Neighbor algorithm identifies handwritten numeric data set with error rate of 1.2%
When the algorithm is actually used, the efficiency of the algorithm is not high. Because the algorithm needs to do 2000 distance calculations for each test vector, each distance calculation includes 1024 dimension floating point operations, a total of 900 times, and we also need to prepare 2MB of storage space for the test vectors. Is there an algorithm that reduces the cost of storage space and computing time? K-decision tree is an optimized version of K-nearest neighbor algorithm, which can save a lot of computational overhead.
Summary
The K-Nearest neighbor algorithm is the simplest and most efficient algorithm for classifying data, and this chapter describes how to construct classifiers using the K-nearest neighbor algorithm in two examples. The K-Nearest neighbor algorithm is an instance-based learning, and we must have training sample data close to the actual data when using the algorithm. The K-Nearest neighbor algorithm must hold all data sets, and if the training data set is large, a large amount of storage space must be used. In addition, because distance values must be calculated for each data in the dataset, it can be very time-consuming to actually use it.