K Nearest Neighbor algorithm

Last Update:2017-10-31 Source: Internet

Author: User

Tags ranges

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Overview

The algorithm is classified by measuring the distance between different eigenvalues.

Advantages: High precision, insensitive to outliers, no data input assumptions.

Disadvantage: The computational complexity is high, one data in each test sample is calculated from the distance of all data in the training sample, so it takes a long time and is inefficient. High space complexity, large amount of data to store, large storage space

Use data range: numeric, nominal (nominal type data to be converted to digital type)

How it works: The algorithm requires a set of training sample data, each with a classification label. After inputting the test data, we compare each characteristic value of the new input sample with the Training sample feature, extract the classification label of the most similar data in the training sample, and K indicates the number of categorical labels extracted , can be customized, and select the optimal k value according to the calculation error rate. Finally, select the most frequently occurring classification in K most similar data, the most new data classification

Algorithm General Flow

(1) Collection of data: Any method of data collection

(2) Prepare data: Organize the collected data into structured data formats that meet the requirements of the algorithm

(3) Analyzing data: Any method

(4) Training algorithm: not applicable and K nearest neighbor algorithm

(5) Test algorithm: Calculate error rate

(6) Using algorithms: Data classification for dating sites, handwritten digit recognition

Prepare data: Import data using Python

 fromNumPyImport*Importoperator#Standard function OperationdefCreateDataSet (): Group= Array ([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])#matrix, eigenvalue quantization, two eigenvaluesLabels = ['A','A','B','B']#list, category labels for the above eigenvalues return group, labels

I am using the python3.6 version here, personally prefer to use the newer version. Note: The training sample eigenvalues are stored as matrices, and the row vectors represent the characteristic data of a sample, and the column vectors represent all the sample values of a feature (remember)

Implementing K-Nearest neighbor algorithm

defClassify0 (inx,dataset,labels,k):#Inx is the input vector for the classification, the dataset is the training sample set, and the labels is the label vectorDatasetsize=Dataset.shape[0] Diffmat=tile (Inx, (datasetsize,1))-DataSet#The tile () function inx the Datasetsize row, the 1 columns repeat the output, and the data corresponding to the dataset training set is subtractedSqdiffmat=diffmat**2#the square of each data point in the matrixSqdistance=sqdiffmat.sum (Axis=1)#the Axis=1 represents the and of the vector for each row of the computed matrixdistance=sqdistance**0.5#line vector, convert to row vectorSortdistindicies=distance.argsort ()#sort from small to large, returning the index of the sorted data points in the original distanceClasscount={}     forIinchRange (k): Votelabel=Labels[sortdistindicies[i]] Classcount[votelabel]=classcount.get (votelabel,0) +1;#0 is the default value        Print(ClassCount) Sortclasscount=sorted (Classcount.items (), Key=operator.itemgetter (1), reverse=true)#Ascending, sortclasscount as List        returnSORTCLASSCOUNT[0][0]

The algorithm returns the classification result of a new input data, and the similarity between the new data and the sample is measured here by the Euclidean distance formula, that is, the distance between vectors. To predict the classification of your data, you can use the command:

group,labels=createdataset ()print(classify0 ([0,0],group,labels,4))

Classifier Performance (classification effect) is affected by a variety of factors, such as classifier settings, datasets and so on, in order to test the effect of the classifier, we can use the test results given by the classifier compared with the real results, calculate the error rate-the number of errors divided by the total number of test executions.

To improve the pairing effect of dating sites as an example

Data Source Link: https://www.manning.com/books/machine-learning-in-action

If we want to see the distribution of the data, we can create a scatter plot with matplotlib.

Import matplotlibfig=plt.figure () Ax=fig.add_subplot (111) Ax.scatter (datingdatamat[:,1], Datingdatamat[:,2],20*array (datinglabels), 15*Array (datinglabels)) plt.show ()

Prepare data: Numerical normalization

If the data size of the two eigenvalues varies greatly, we can use normalization to convert the eigenvalues to between 0 and 1

Newvalue= (oldvalue-min)/(Max-min)

Here is the code for normalized eigenvalues:

def Autonorm (dataSet):     = Dataset.min (0)# Select the minimum number per column    maxvals = Dataset.max (0    )= maxvals- minvals    = Zeros (shape (dataSet))    = dataset.shape[0]    = Dataset-tile (Minvals, (m,1) )    = Normdataset/tile (ranges, (m,1))   #element wise divide    return normdataset, Ranges, minvals# returns the normalized data matrix, the change size of each eigenvalue, the minimum value of the eigenvalues

Test algorithm: Verify the classifier

defdatingclasstest (): HoRatio=0.1Datingdatamat,datinglabels=file2matrix ('DatingTestSet.txt') Normmat,rangs,minvals=autonorm (Datingdatamat) m=normmat.shape (0) Numtestvecs=int (m*hoRatio) Errorcount=0.0;  forIinchRange (numtestvecs): Classifierresult=classify0 (normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3)        if(classifierresult!=Datinglabels[i]): Errorcount+=1; Print(Errorcount/float (numtestvecs)) #输出错误率

Using algorithms: Building a complete and usable system

The following function allows the user to enter three eigenvalues and the program will automatically give the predicted value

defClassperson (): Resultlist= [' not at all','In small doses','In large doses'] Percenttats= Float (Input ("percentage of time spent playing video games?")) Ffmiles= Float (Input ("frequent flier miles earned per year?")) Icecream= Float (Input ("liters of ice cream consumed per year?")) Datingdatamat, Datinglabels= File2matrix ('DatingTestSet2.txt') Normmat, rangs, Minvals=autonorm (datingdatamat) Inarr=Array ([Ffmiles,percentile,icecream]) Classifierresult=classify0 ((inarr-minvals)/rangs,normmat,datinglabels,3)    Print(Resultlist[classifierresult-1])

The above is the K nearest neighbor algorithm and its use, the algorithm requires close to the actual data training sample data, at the same time, out of the above shortcomings, the algorithm has a flaw: can not give any data infrastructure information, so we do not know the average instance sample and the characteristics of typical samples

K Nearest Neighbor algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More