The content mainly comes from the machine learns the actual combat this book, adds own understanding.
A simple description of the 1.KNN algorithm
The k nearest neighbor (k-nearest NEIGHBOR,KNN) classification algorithm can be said to be the simplest machine learning algorithm. It is classified by measuring the distance between different eigenvalue values. Its idea is simple: if a sample is the most similar in the K in the feature space (that is, the nearest neighbor in the feature space), the sample belongs to that category. is one of the most classic example diagrams that you have quoted.
For example above, we have two types of data, blue squares and red triangles, which are distributed in the middle of a two-dimensional one. So if we have a green circle this data, we need to determine whether this data belongs to the Blue Block category, or the same as the red triangle. How do you do it? Let's find out the nearest point of the green circle, because we feel that the nearest to the Green Circle is a judgment on its category. How many will it take to judge? This number is K. If k=3, it means we choose the nearest 3 points from the green circle to judge, because the proportion of the red triangle is 2/3, so we think the green circle is similar to the red triangle. If k=5, because the Blue quad-square ratio is 3/5, the green circle is given a blue quad-square class. As you can see from here, the value of K is important to select.
In the KNN algorithm, the selected neighbors are the objects that have been correctly categorized. This method determines the category to which the sample is to be divided based on the category of the nearest one or several samples in the categorical decision-making. The KNN method is more suitable than other methods because the KNN method mainly relies on the surrounding finite sample, rather than the Discriminant class domain method to determine the category of the class.
The main disadvantage of this algorithm is that when the sample is unbalanced, the sample size of a class is large, and the other class sample capacity is very small, it is possible that when a new sample is entered, the sample of the large-capacity class in the K-neighbor of the specimen is the majority. Therefore, the method of weight can be used (and the value of the neighbor with small distance of the sample is large) to improve. Another disadvantage of this method is that it is computationally large because each text to be classified is calculated from its distance to all known samples in order to obtain its K nearest neighbors. At present, the common solution is to pre-edit the known sample points in advance to remove the small sample of the role of classification. This algorithm is suitable for the automatic classification of the class domain with large sample capacity, while those with smaller sample capacity are more prone to error points.
All in all, we already have a tagged data comparison library, and then enter new data with no tags, then compare each feature of the new data to the one that corresponds to the data in the sample set, and then the algorithm extracts the most similar (nearest neighbor) category labels in the sample set. In general, only the first k most similar data in a sample database is selected. Finally, select the most frequently occurring categories in the K most similar data. The algorithm is described as follows:
1) Calculate the distance between the point in the data set of the known category and the current point;
2) Sort by the increment order of distance;
3) Select K points with the minimum distance from the current point;
4) Determine the occurrence frequency of the category of the first k points;
5) returns the category with the highest frequency of the first K points as the predicted classification of the current point.
Two: Python program section
2.1 Python Import data
Def createdataset (): group = Array ([[[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels = [' A ', ' a ', ' B ', ' B '] Return group, Labels
Datasets and labels are created.
According to the algorithm described above, the five steps of the K-nearest neighbor algorithm core part of the program:
def classify0 (InX, DataSet, labels, k): datasetsize = dataset.shape[0] Diffmat = Tile (InX, (datasetsize,1))-Dat ASet # tile:construct array by repeating InX datasetsize times sqdiffmat = diffmat**2 sqdistances = Sqdiffmat . SUM (Axis=1) distances = sqdistances**0.5 # get distance sorteddistindicies = Distances.argsort () # Return ordered array ' s index classcount={} for I in range (k): Voteilabel = labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1 sortedclasscount = sorted (Classcount.iteritems ( ), Key=operator.itemgetter (1), reverse=true)
Do not know is not a coding setup problem, comments can not be written in Chinese, only English.
The K-Nearest neighbor algorithm book is applied to improve the matching effect of dating sites above the specific process:
Prepare the Data section: parse the data from a text file, with 3 features in the text: miles flown, Time spent playing games, and consumption of ice cream. I don't know why the authors chose these three traits, as if they had nothing to do with dating matches.
This section uses a number of functions that deal with matrices in NumPy.
def file2matrix (filename): fr = open (filename) numberoflines = Len (Fr.readlines ()) #get the number of lines In the file Returnmat = zeros ((numberoflines,3)) #prepare matrix to return classlabelvector = [] # Prepare labels return FR = open (filename) index = 0 for line in Fr.readlines (): Line = Line.strip () # D elete character like Tab or backspace listfromline = line.split (' \ t ') returnmat[index,:] = Listfromline[0:3] # Get 3 features classlabelvector.append (int (listfromline[-1])) # Get classify result index + = 1 return Returnmat,classlabelvector
The normalization of data values is involved in processing data. It means that there are three features on the dating match, but it will be found that the flight distance is much larger than the other two, in order to reflect the 3 characteristics of the same influence, the data are normalized.
def autonorm (dataSet): minvals = dataset.min (0) # Select least value in column maxvals = Dataset.max (0) ranges = maxvals-minvals normdataset = zeros (Shape (dataSet)) m = dataset.shape[0] normdataset = Dataset-ti Le (Minvals, (m,1)) Normdataset = Normdataset/tile (ranges, (m,1)) #element wise divide return Normdataset, Ranges, minvals
Another application is in the handwriting recognition system. Similar to the previous appointment site application, the preparation of the data requires an image-to-vector conversion, and then calls the K-nearest neighbor's core algorithm implementation.
Here are all the code synthesis and test code: The main function adds some matplotlib drawing test code
"' Knn:k Nearest NeighborsInput:inX:vector to compare to existing dataset (1xN) dataset:size m data se T of known vectors (NxM) labels:data set labels (1xM vector) K:number of neighbors to use for comp Arison (should be a odd number) Output:the Most popular class label "from NumPy import *import Operatorf Rom OS import listdirimport matplotlibimport matplotlib.pyplot as Pltdef classify0 (InX, DataSet, labels, k): Datasetsiz E = dataset.shape[0] Diffmat = Tile (InX, (datasetsize,1))-DataSet # tile:construct array by repeating InX Datasetsi Ze times sqdiffmat = diffmat**2 sqdistances = sqdiffmat.sum (axis=1) distances = sqdistances**0.5 # get distance Sorteddistindicies = Distances.argsort () # return ordered array ' s index classcount={} for I in R Ange (k): Voteilabel = Labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1 Sortedclasscount =Sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true) return sortedclasscount[0][0] def CreateDataSet (): group = Array ([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels = [' A ', ' a ', ' B ', ' B '] return group, lab Elsdef File2matrix (filename): FR = open (filename) numberoflines = Len (Fr.readlines ()) #get the number of Lin Es in the file Returnmat = Zeros ((numberoflines,3)) #prepare matrix to return classlabelvector = [] #prepare Labels Return fr = open (filename) index = 0 for line in Fr.readlines (): line = Li Ne.strip () # Delete character like tab or backspace listfromline = line.split (' \ t ') Returnmat[index,:] = Li Stfromline[0:3] # get 3 features Classlabelvector.append (int (listfromline[-1])) # Get classify result index + = 1 return returnmat,classlabelvector def autonorm (dataSet): minvals = dataset.min (0) # Select least value in Column Maxvals = Dataset.max (0) Ranges = Maxvals-minvals Normdataset = zeros (Shape (dataSet)) m = dataset.shape[0] Normdataset = Dataset-til E (Minvals, (m,1)) Normdataset = Normdataset/tile (ranges, (m,1)) #element wise divide return normdataset, ranges, M Invals def datingclasstest (): HoRatio = 0.50 #hold out 10% datingdatamat,datinglabels = File2matrix (' E:\Pytho Nmachine learning in Action\datingtestset2.txt ') #load data setfrom file Normmat, ranges, minvals = Autonorm (dati Ngdatamat) m = normmat.shape[0] print m numtestvecs = Int (m*horatio) Errorcount = 0.0 for i in range (numtes TVECS): Classifierresult = classify0 (normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3) PR int "The classifier came back with:%d, the real answer is:%d"% (Classifierresult, datinglabels[i]) if (classifie Rresult! = datinglabels[i]): Errorcount + = 1.0 print "The total error rate is:%f"% (Errorcount/float (numtestvecs)) Print Errorcountdef CLASsifyperson (): resultlist = [' Not @ all ', ' in small doses ', ' large doses '] percenttats = float (raw_input (' Percenta GE time spent on games? ')) Ffmiles = float (raw_input (' Frequent flier miles per year? ')) Icecream = float (raw_input (' liters of ice cream consumed each year? ')) Datingdatamat,datinglabels = File2matrix (' E:\PythonMachine Learning in Action\datingtestset2.txt ') #load data Setfro M file Normmat, ranges, minvals = Autonorm (datingdatamat) Inarr = Array ([Ffmiles,percenttats,icecream]) Classifie Rresult = Classify0 ((inarr-minvals)/ranges,normmat,datinglabels,3) print "Your probably like this person:", re Sultlist[classifierresult-1]def img2vector (filename): Returnvect = Zeros ((1,1024)) FR = open (filename) for I in R Ange (+): Linestr = Fr.readline () for J in Range (+): returnvect[0,32*i+j] = Int (linestr[j]) r Eturn returnvectdef handwritingclasstest (): Hwlabels = [] trainingfilelist = Listdir (' E:/pythOnmachine learning in Action/trainingdigits ') #load the training set m = Len (trainingfilelist) Trainingmat = Zeros ((m,1024)) for I in Range (m): Filenamestr = trainingfilelist[i] Filestr = Filenamestr.split ('. ') [0] #take off. txt classnumstr = int (Filestr.split ('_') [0]) Hwlabels.append (CLASSNUMSTR) Training Mat[i,:] = Img2vector (' E:/pythonmachine learning in action/trainingdigits/%s '% filenamestr) testfilelist = Listdir (' E: /pythonmachine learning in Action/testdigits ') #iterate through the test set errorcount = 0.0 mtest = Len (tes Tfilelist) for I in Range (mtest): Filenamestr = testfilelist[i] Filestr = Filenamestr.split ('. ') [0] #take off. txt classnumstr = int (Filestr.split ('_') [0]) Vectorundertest = Img2vector (' E:/pythonmachi NE learning in action/testdigits/%s '% filenamestr) Classifierresult = Classify0 (Vectorundertest, Trainingmat, HwLa BELs, 3) print "The ClassifIer came back with:%d, the real answer is:%d "% (Classifierresult, classnumstr) if (classifierresult! = Classnums TR): Errorcount + = 1.0 print "\nthe total number of errors are:%d"% errorcount print "\nthe total error rate is:%f "% (Errorcount/float (mtest)) if __name__== ' __main__ ': #classifyperson () datingclasstest () dataSet, labels = creat Edataset () Testx = Array ([1.2, 1.0]) K = 3 Outputlabel = classify0 (TESTX, DataSet, labels, 3) print "Your in Put is: ", Testx," and classified to class: ", Outputlabel testx = Array ([0.1, 0.3]) Outputlabel = classify0 (test X, DataSet, labels, 3) print "Your input is:", Testx, "and classified to class:", Outputlabel handwritingclasstest ( ) Datingdatamat,datinglabels = File2matrix (' E:\PythonMachine Learning in Action\datingtestset2.txt ') print Datingdat AMat print datinglabels[0:20] Fig = plt.figure () ax = Fig.add_subplot (111) Ax.scatter (datingdatamat[:,1],datin Gdatamat[:,2],15.0*array(Datinglabels), 15.0*array (Datinglabels)) Plt.show ()
Here to note:
Trainingfilelist = Listdir (' E:/pythonmachine Learning in Action/trainingdigits ')
This function is called when the path is written, if you do not want to complicate the path to specify a simple folder and knn.py files together.
Implementation of the K-nearest neighbor algorithm Python