Overview
The algorithm is classified by measuring the distance between different eigenvalues.
Advantages: High precision, insensitive to outliers, no data input assumptions.
Disadvantage: The computational complexity is high, one data in each test sample is calculated from the distance of all data in the training sample, so it takes a long time and is inefficient. High space complexity, large amount of data to store, large storage space
Use data range: numeric, nominal (nominal type data to be converted to digital type)
How it works: The algorithm requires a set of training sample data, each with a classification label. After inputting the test data, we compare each characteristic value of the new input sample with the Training sample feature, extract the classification label of the most similar data in the training sample, and K indicates the number of categorical labels extracted , can be customized, and select the optimal k value according to the calculation error rate. Finally, select the most frequently occurring classification in K most similar data, the most new data classification
Algorithm General Flow
(1) Collection of data: Any method of data collection
(2) Prepare data: Organize the collected data into structured data formats that meet the requirements of the algorithm
(3) Analyzing data: Any method
(4) Training algorithm: not applicable and K nearest neighbor algorithm
(5) Test algorithm: Calculate error rate
(6) Using algorithms: Data classification for dating sites, handwritten digit recognition
Prepare data: Import data using Python
fromNumPyImport*Importoperator#Standard function OperationdefCreateDataSet (): Group= Array ([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])#matrix, eigenvalue quantization, two eigenvaluesLabels = ['A','A','B','B']#list, category labels for the above eigenvalues return group, labels
I am using the python3.6 version here, personally prefer to use the newer version. Note: The training sample eigenvalues are stored as matrices, and the row vectors represent the characteristic data of a sample, and the column vectors represent all the sample values of a feature (remember)
Implementing K-Nearest neighbor algorithm
defClassify0 (inx,dataset,labels,k):#Inx is the input vector for the classification, the dataset is the training sample set, and the labels is the label vectorDatasetsize=Dataset.shape[0] Diffmat=tile (Inx, (datasetsize,1))-DataSet#The tile () function inx the Datasetsize row, the 1 columns repeat the output, and the data corresponding to the dataset training set is subtractedSqdiffmat=diffmat**2#the square of each data point in the matrixSqdistance=sqdiffmat.sum (Axis=1)#the Axis=1 represents the and of the vector for each row of the computed matrixdistance=sqdistance**0.5#line vector, convert to row vectorSortdistindicies=distance.argsort ()#sort from small to large, returning the index of the sorted data points in the original distanceClasscount={} forIinchRange (k): Votelabel=Labels[sortdistindicies[i]] Classcount[votelabel]=classcount.get (votelabel,0) +1;#0 is the default value Print(ClassCount) Sortclasscount=sorted (Classcount.items (), Key=operator.itemgetter (1), reverse=true)#Ascending, sortclasscount as List returnSORTCLASSCOUNT[0][0]
The algorithm returns the classification result of a new input data, and the similarity between the new data and the sample is measured here by the Euclidean distance formula, that is, the distance between vectors. To predict the classification of your data, you can use the command:
group,labels=createdataset ()print(classify0 ([0,0],group,labels,4))
Classifier Performance (classification effect) is affected by a variety of factors, such as classifier settings, datasets and so on, in order to test the effect of the classifier, we can use the test results given by the classifier compared with the real results, calculate the error rate-the number of errors divided by the total number of test executions.
To improve the pairing effect of dating sites as an example
Data Source Link: https://www.manning.com/books/machine-learning-in-action
If we want to see the distribution of the data, we can create a scatter plot with matplotlib.
Import matplotlibfig=plt.figure () Ax=fig.add_subplot (111) Ax.scatter (datingdatamat[:,1], Datingdatamat[:,2],20*array (datinglabels), 15*Array (datinglabels)) plt.show ()
Prepare data: Numerical normalization
If the data size of the two eigenvalues varies greatly, we can use normalization to convert the eigenvalues to between 0 and 1
Newvalue= (oldvalue-min)/(Max-min)
Here is the code for normalized eigenvalues:
def Autonorm (dataSet): = Dataset.min (0)# Select the minimum number per column maxvals = Dataset.max (0 )= maxvals- minvals = Zeros (shape (dataSet)) = dataset.shape[0] = Dataset-tile (Minvals, (m,1) ) = Normdataset/tile (ranges, (m,1)) #element wise divide return normdataset, Ranges, minvals# returns the normalized data matrix, the change size of each eigenvalue, the minimum value of the eigenvalues
Test algorithm: Verify the classifier
defdatingclasstest (): HoRatio=0.1Datingdatamat,datinglabels=file2matrix ('DatingTestSet.txt') Normmat,rangs,minvals=autonorm (Datingdatamat) m=normmat.shape (0) Numtestvecs=int (m*hoRatio) Errorcount=0.0; forIinchRange (numtestvecs): Classifierresult=classify0 (normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3) if(classifierresult!=Datinglabels[i]): Errorcount+=1; Print(Errorcount/float (numtestvecs)) #输出错误率
Using algorithms: Building a complete and usable system
The following function allows the user to enter three eigenvalues and the program will automatically give the predicted value
defClassperson (): Resultlist= [' not at all','In small doses','In large doses'] Percenttats= Float (Input ("percentage of time spent playing video games?")) Ffmiles= Float (Input ("frequent flier miles earned per year?")) Icecream= Float (Input ("liters of ice cream consumed per year?")) Datingdatamat, Datinglabels= File2matrix ('DatingTestSet2.txt') Normmat, rangs, Minvals=autonorm (datingdatamat) Inarr=Array ([Ffmiles,percentile,icecream]) Classifierresult=classify0 ((inarr-minvals)/rangs,normmat,datinglabels,3) Print(Resultlist[classifierresult-1])
The above is the K nearest neighbor algorithm and its use, the algorithm requires close to the actual data training sample data, at the same time, out of the above shortcomings, the algorithm has a flaw: can not give any data infrastructure information, so we do not know the average instance sample and the characteristics of typical samples
K Nearest Neighbor algorithm