The specific explanation of machine learning classical algorithm and Python implementation of--k nearest neighbor (KNN) algorithm

Source: Internet
Author: User
Tags ranges

(a) KNN is still a supervised learning algorithmThe KNN (K Nearest neighbors,k nearest neighbor) algorithm is the simplest theory in all machine learning algorithms. Better to understand. KNN is an instance-based learning, by calculating the distance between new data and the characteristic values of the training data, and then selecting K (k>=1) nearest neighbor for categorical inference (voting) or regression. Suppose K=1. Then the new data is simply assigned to its nearest neighbor class. is the KNN algorithm a supervised study or unsupervised learning? First, consider the definition of supervised learning and unsupervised learning. For supervised learning. The data has a clear label (classification for discrete distributions, regression for continuous distribution), a model based on machine learning can divide new data into a clear class or get a pre-measured value. For unsupervised learning, the data does not have a label. The model that the machine learns is the pattern extracted from the data (extracting the decisive features or clustering, etc.). Clustering, for example, is a machine-based learning model that infers new data "more like" which original data sets. When the KNN algorithm is used for classification, each training data has a clear label. It is also possible to infer that the label,knn of the new data is used to predict a clear value based on the neighbor's value, so KNN belongs to supervised learning.


The KNN algorithm process is:

    1. Select a distance calculation method to calculate the distance from the new data to the data points in a known category DataSet with all the characteristics of the data
    1. Sort in ascending order of distance. Select K points that are the least current distance
    1. For discrete classification, the category with the most frequency of K points is returned as a pre-measured classification, and for regression the weighted value of K points is returned as a pre-measured value.
(ii) KNN algorithm keyKNN algorithm theory and process is so simple, in order to get a better learning effect, there are a few places to pay attention to the following.
1, all the characteristics of the data to do a comparable quantification.


If there are non-numeric types in the data characteristics, they must be quantified by means of numerical values.

Give me a sample example. If the sample features include color (red-black-blue), there is no distance between the colors, you can achieve the distance calculation by converting the color to grayscale values.

In addition, the sample has multiple parameters, each of which has its own domain and range of values, and they have a different effect on the distance calculation. If the value of the larger influence will be covered by a smaller number of parameters. In order to be fair, sample parameters have to do some scale processing, the simplest way is that all the characteristics of the values are taken to the normalization of disposal.
2, need a distance function to calculate the distance between two samples .
There are very many definitions of distances. such as Euclidean distance, cosine distance, Hamming distance, Manhattan distance, and so on, the method of similarity measurement can be used to refer to the "Random Talk: The method of distance and similarity measurement in machine learning". In normal cases, the Euclidean distance is measured as a distance, but this is only applicable to continuous variables.

In the case of discontinuous variables such as text classification, Hamming distance can be used as a measure. usually. Assuming that some special algorithms are used to calculate the metric, the accuracy of K-nearest neighbor classification can be significantly improved, such as by using the nearest neighbor method or near-neighbor component analysis.
3, determine the value of K
K is a self-defined constant, the value of K also directly affects the final estimate, a choice K is worth using cross-validate (cross-validation) Error statistical selection method . The concept of cross-validation was previously mentioned as part of a sample of data as a training sample, as part of a test sample. For example, choose 95% as the training sample, and the remainder as a test sample.

Train a machine learning model by training data and then test its error rate using test data. cross-validate (cross-validation) Error statistic selection method is the average error rate of cross-validation when comparing different K values. Select the K value with the lowest error rate.

For example, choose k=1,2,3,... , 100 cross-validation is done for each of the k=i. The average error is calculated. Then compare and choose the smallest one .

(iii) KNN classificationThe training sample is a multidimensional feature space vector in which each training sample has a category tag (likes or dislikes, preserves, or deletes). The classification algorithm is often determined by the "majority vote", which is the class of the most frequently occurring classes in the K-neighbor as the predicted class. One drawback of the "majority vote" classification is that more frequent samples will lead to the predicted results of the pilot. That's because they're more likely to be tested in the K neighborhood of the current test, and the properties of the pilot are computed by a sample in the K domain. One of the ways to solve this shortcoming is to consider the distance from the K-neighbor to the test pilot when classifying. For example, if the sample to test the distance of D, then choose 1/D as the neighbor's weight (that is, the neighbor's class weight), the next statistic statistics K neighbors all class label weights and, the most value is the new data point of the pre-class label.
For example, k=5, the example of a new data point to the recent five Neighbors is (1,3,3,4,5), and the class tag of five Neighbors is (yes,no,no. Yes,no)
In the case of a majority vote, the new data point category is no (3 no,2 Yes) and yes (NO:2/3+1/5,YES:1+1/4) if the distance weight category is considered.


The following Python program is an example of using the KNN algorithm (to calculate Euclidean distance. Majority voting method decision): One is to use KNN algorithm to improve the dating site pairing effect. Another is the use of KNN algorithm for handwriting recognition.


An example of an improvement to the dating site pairing effect is to infer whether it is a Helen girl's preferred type (category is very like, general and annoying) based on the man's annual mileage, video game time ratio and weekly ice-cream consumption of three.

Since the range of values for the three features is different, the scale strategy used here is normalized.
The handwriting recognition system using the KNN classifier can only recognize numbers 0 to 9.

The numbers that need to be identified are processed into the same color and size using the graphics processing software: The wide-high is a black-and-white image of 32 pixels X32 pixels. While storing images in text format does not make efficient use of memory space, the image is converted to text format for ease of understanding. Each of the numbers in the training data is about 200 samples. The program formats the image sample as a vector, a vector that converts a 32x32 binary image matrix into a 1x1024.

From numpy import *import operatorfrom os import listdirimport matplotlibimport matplotlib.pyplot as Pltimport pdbdef clas Sify0 (InX, DataSet, labels, k=3): #pdb. Set_trace () datasetsize = dataset.shape[0] Diffmat = Tile (InX, (datasetsiz e,1)-DataSet Sqdiffmat = diffmat**2 sqdistances = sqdiffmat.sum (axis=1) distances = sqdistances**0.5 sorted        Distindicies = Distances.argsort () #ascend sorted, #return the index of unsorted, that's to choose the least 3 item classcount={} for I in range (k): Voteilabel = labels[sorteddistindicies[i]] Classcount[vot  Eilabel] = Classcount.get (voteilabel,0) + # A dict with label as key and occurrence number as value Sortedclasscount =    Sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true) ' descend sorted according to value, ' '    return sortedclasscount[0][0]def File2matrix (filename): FR = open (filename) #pdb. Set_trace () L = Fr.readlines () NumberOfLines = Len (L)         #get the number of lines in the file Returnmat = Zeros ((numberoflines,3)) #prepare matrix to return Classlabelvector = [] #prepare labels return index = 0 for line in L:line = Lin E.strip () Listfromline = Line.split (' \ t ') Returnmat[index,:] = Listfromline[0:3] Classlabelvector.app End (int (listfromline[-1))) #classLabelVector. Append ((listfromline[-1)) Index + = 1 fr.close () return R  Eturnmat,classlabelvectordef plotscattter (): Datingdatamat,datinglabels = File2matrix (' datingTestSet2.txt ') #load Data setfrom File Fig = plt.figure () Ax1 = Fig.add_subplot (111) ax2 = Fig.add_subplot (111) ax3 = FIG.ADD_SUBP Lot (111) ax1.scatter (Datingdatamat[:,0],datingdatamat[:,1],15.0*array (datinglabels), 15.0*array (DatingLabels)) # Ax2.scatter (Datingdatamat[:,0],datingdatamat[:,2],15.0*array (datinglabels), 15.0*array (DatingLabels)) # Ax2.scatter (datingdatamat[:,1],datingdatamat[:, 2],15.0*array (Datinglabels), 15.0*array (Datinglabels)) plt.show () def autonorm (dataSet): Minvals = DATASET.MI N (0) maxvals = Dataset.max (0) ranges = maxvals-minvals Normdataset = zeros (Shape (dataSet)) m = Dataset.shape [0] Normdataset = Dataset-tile (Minvals, (m,1)) Normdataset = Normdataset/tile (ranges, (m,1)) #element wise divID e return normdataset, ranges, Minvals def datingclasstest (hoRatio = 0.20): #hold out 10% Datingdatamat,datingla BELs = File2matrix (' datingTestSet2.txt ') #load data setfrom file Normmat, ranges, minvals = Autonorm (Datingdatama T) m = normmat.shape[0] numtestvecs = Int (m*horatio) Errorcount = 0.0 for i in range (numtestvecs): Clas  Sifierresult = Classify0 (normmat[i,:],normmat[numtestvecs:m,:],datinglabels[numtestvecs:m],3) print "The classifier Came back with:%d, the real answer is:%d "% (Classifierresult, datinglabels[i]) if (classifierresult! = Datingla Bels[i]): Errorcount + =1.0 print "The total error rate was:%.2f%%"% (100*errorcount/float (numtestvecs)) print ' Testcount is%s, Errorcount    Is%s '% (numtestvecs,errorcount) def Classifyperson (): "Input a person, decide" or not "then update the DB "' resultlist = [' Not @ all ', ' little doses ', ' large doses '] percenttats = float (raw_input (' Input the person\ ' per Centage of time playing video games: ')] Ffmiles = float (raw_input (' Flier miles in a year: ')) icecream = float (raw_in Put (' amount of icecream consumed per year: ') Datingdatamat,datinglabels = File2matrix (' datingTestSet2.txt ') Normmat , ranges, minvals = Autonorm (datingdatamat) Normperson = (Array ([Ffmiles,percenttats,icecream])-minvals)/ranges ResU     lt = classify0 (Normperson, Normmat, Datinglabels, 3) print ' You'll probably like this guy in: ', resultlist[result-1] #update the datingtestset print ' Update dating DB ' tmp = ' \ t '. Join ([Repr (Ffmiles), repr (Percenttats), repr (Icecrea  m), repr (Result)]) + ' \ n '  With open (' DatingTestSet2.txt ', ' a ') as Fr:fr.write (TMP) def img2file (filename): #vector = zeros (1,1024) Wit H Open (filename) as fr:l=fr.readlines () vector =[int (l[i][j]) for I in range (+) for J in range (+)] Return Array (Vector,dtype = float) def handwritingclasstest (): Hwlabels = [] trainingfilelist = Listdir (' Trainingdig         Its ') #load the training set m = Len (trainingfilelist) Trainingmat = Zeros ((m,1024)) for I in Range (m): Filenamestr = trainingfilelist[i] Filestr = Filenamestr.split ('. ') [0] #take off. txt classnumstr = int (Filestr.split ('_') [0]) Hwlabels.append (CLASSNUMSTR) Training Mat[i,:] = Img2vector (' trainingdigits/%s '% filenamestr) testfilelist = Listdir (' testdigits ') #iterate through T He test set errorcount = 0.0 mtest = Len (testfilelist) for I in Range (mtest): Filenamestr = Testfilelist[i ] Filestr = Filenamestr.split ('. ')  [0] #take off. txt      classnumstr = Int (Filestr.split ('_') [0]) Vectorundertest = Img2vector (' testdigits/%s '% filenamestr) c Lassifierresult = Classify0 (Vectorundertest, Trainingmat, Hwlabels, 3) print "The classifier came back with:%d, th    E Real answer is:%d "% (Classifierresult, classnumstr) if (classifierresult! = classnumstr): Errorcount + = 1.0 Print "\nthe total number of errors are:%d"% errorcount print "\nthe total error rate is:%f"% (Errorcount/float (mTes T)) if __name__ = = ' __main__ ': Datingclasstest () #handwritingClassTest ()

The KNN Algorithm learning package is:

Machine learning K Nearest neighbor algorithm

(iv) KNN regression

a KNN algorithm is a regression when the category label of a data point is a continuous value. Same as the KNN classification algorithm process. The difference lies in the handling of the K-neighbors. KNN regression is the pre-measured value of the K-neighbor class tag which is weighted as a new data point. The weighted methods are: the average (worst) of the attribute values of the K nearest neighbors, and the 1/d as weights (effective measures the weight of the neighbors. The weight of the nearest neighbor is greater than that of the neighboring neighbor), the Gaussian function (or other appropriate subtraction function) calculates the weight = Gaussian (distance) (the farther away the value is, the smaller the weighting, the more accurate the estimate.
(v) Summary

The K-Nearest neighbor algorithm is the simplest and most efficient algorithm for classifying data, and its learning is based on the example, we must have the training sample data close to the actual data when using the algorithm. The K-Nearest neighbor algorithm must hold all datasets, assuming that the training dataset is very large and must use a lot of storage space. In addition, because distance values must be calculated for each data in the dataset, it can be very time-consuming to actually use it.

Another drawback of the K-nearest neighbor algorithm is that it cannot give any information about the underlying structure of the data, so we cannot know what the average instance sample and typical instance samples have.

author of this article Adan, derived from: machine learning classical algorithm specific interpretation and Python implementation of--k nearest neighbor (KNN) algorithm . Reprint please indicate the source.


The specific explanation of machine learning classical algorithm and Python implementation of--k nearest neighbor (KNN) algorithm

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.