ML (5): KNN algorithm

Last Update:2017-04-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

K Nearest neighbor algorithm, that is k-nearest Neighbor algorithm, short of KNN algorithm, can be simply understood as the nearest to their own K-point to vote to decide what kind of data to classify . This algorithm is a relatively classical algorithm in machine learning, in general, KNN algorithm is relatively easy to understand the algorithm. The k represents the closest to their own K data samples. The KNN algorithm and the K-means algorithm are different, the K-means algorithm is used to cluster, to determine which things are a relatively similar type, and the KNN algorithm is used to do the classification, that is, there is a sample space in the sample into several types, and then, given a data to be classified, Determine which classification the data to classify belongs to by calculating the nearest K sample.

Directory:

Algorithm overview
Working principle
Selection of K-values
Normalization of processing
KNN R Example
Guess the Model Code

Algorithm overview

As shown, there are two different kinds of sample data, respectively, with a small blue square and a small red triangle, and the figure in the middle of the green circle marked by the data is to be classified. In other words, now, we do not know the middle of the green data is from which category (blue small square or red small triangle), we have to solve this problem: to this green circle classification
From here, you can also see: if k=3, the nearest 3 neighbors of the Green Dot is 2 red small triangles and a small blue square, a few from the majority, based on the statistical method, the green of this to classify point belongs to the Red triangle category.
If k=5, the nearest 5 neighbors of the Green dot is 2 red triangles and 3 blue squares, or a few subordinate to the majority, based on the statistical method, the green of this to classify point belongs to the Blue Square category
As we can see, when we cannot determine which of the current classification points is from the category of the known classification, we could look at the position characteristics of the data according to the theory of statistics, measure the weight of its neighbors, and classify it as (or allocate) to the larger weight. This is the core idea of K-nearest neighbor algorithm.

Working principle

We know the corresponding relationship between each data in the sample set and the classification, and after entering new data with no tags, we compare the new data with the data corresponding to the training set, find the nearest K data of the "distance" , and select the category of the most occurrences of the K data as the classification of the new data.
Algorithm description

1. Calculates the distance between a point in a well-known dataset and the current point
2. Sort by distance increment order
3. Select the nearest k point from the current data point
4. Determine how often the first K points appear in the category
5. Returns the highest-frequency category as a forecast for the current category

The distance calculation method has "Euclidean" (Euclidean distance), "Minkowski" (Minkovski distance), "maximum" (Chebyshev snow Distance), "Manhattan" (absolute distance), "Canberra" (blue distance), or " Minkowski "(Markov distance), etc.
In the KNN algorithm, the similarity of two records is determined by the Euclidean distance .
Algorithm Disadvantages:

1. K values need to be pre-set, not self-adapting
2. Sample imbalance, such as a class of sample capacity is very large, and other class sample capacity is very small, it may lead to when a new sample is entered, the sample of the K neighbors of the bulk class sample of the majority

Selection of K-values

In addition to how to define the neighbor's problem, there is a choice of how many neighbors, that is, the K value is defined as how big the problem. Do not underestimate this K-value selection problem, because it will have a significant impact on the results of the K-nearest neighbor algorithm.
If you choose a smaller k value, which is equivalent to using a training instance in a smaller field to predict, the "learning" approximation error will decrease, and only training instances that are closer to or similar to the input instance will work on the prediction results, while the problem is that the "learning" estimate error will increase, in other words, The decrease of K value means that the whole model becomes complex and easy to fit;
If the large k value is chosen, it is equivalent to using the training example in the larger field to predict, the advantage is that it can reduce the learning estimation error, but the disadvantage is that the approximate error of learning will increase. At this point, the training instance, which is far away from the input instance, also acts on the Predictor, making the prediction error, and the increase in the K value means that the overall model becomes simple.
In practical applications, K values generally take a relatively small value, for example, the use of cross-validation (in short, is part of the sample training set, part of the test set) to select the best K value.
In general, thevalue of k is preferably the number of the data set of the root, and the best to take odd,? In the example below Iris is 150 data, so here K value is selected 13.

Normalization of processing

Data Standardization (normalization) processing is a basic work of data mining, the different evaluation indicators often have different dimensions and dimensional units, this situation will affect the results of data analysis, in order to eliminate the dimensional impact between the indicators, data standardization needs to be processed to solve the comparability of data indicators . After the raw data has been standardized, the indexes are in the same order of magnitude, which is suitable for comprehensive contrast evaluation . Here are two common normalization methods:
Min-max Normalization (Min-max normalization): also known as dispersion normalization, is a linear transformation of the original data, which maps the resulting value between [0-1]. The conversion functions are as follows:

2. Where Max is the maximum value of the sample data, Min is the minimum value of the sample data. One drawback of this approach is that when new data is added, it can lead to changes in Max and min that need to be redefined.

Z-score Standardization Method: This method standardizes the data by giving the mean value (mean) and standard deviation of the original data (deviation). The processed data conforms to the standard normal distribution, that is, the mean value is 0, the standard deviation is 1, and the conversion function is:

2. This is the mean value for all sample data, which is the standard deviation for all sample data.

R Example

When R is implemented, the class package can be selected, or the KKNN package can be selected for calculation

For example, the example code for Iris is as follows:

#---------------------R:KNN algorithm--------------------------------Head (IRIS) a&LT;-IRIS[-5]#Remove the column of the tag typeHead (a) a<-scale (a)#Z-score StandardizationStr (a) head (a) train<-a[c (1:25,50:75,100:125),]#Training SetHead (train) test<-a[c (26:49,76:99,126:150),]#Test Set#the next step is to save the training set and the type tag of the test set.Train_lab <-iris[c (1:25,50:75,100:125), 5]test_lab<-iris[c (26:49,76:99,126:150), 5]#In the KNN classification example, the packages used in R are "class packages", "Gmodels packages "#install.packages ("class")Libraryclass)#then we can call the KNN function to build the model .## Data frame, K nearest neighbor poll, Euclidean distancePRE_RESULT&LT;-KNN (train=train,test=test,cl=train_lab,k=13) Table (Pre_result,test_lab)#---------------------r:kknn Bag--------------------------------#install.packages ("KKNN")Library (KKNN) data ("Iris") Dim (Iris) M<-(Dim (Iris)) [1]ind<-sample (2, M, Replace=true, Prob=c (0.7, 0.3)) Iris.train<-iris[ind==1,]iris.test<-iris[ind==2,]#first define a formula before calling Kknn#myformula:species ~ sepal.length + sepal.width + petal.length + petal.widthIRIS.KKNN&LT;-KKNN (species~.,iris.train,iris.test,distance=1,kernel="Triangular") Summary (IRIS.KKNN)#Get Fitted.valuesFit <-fitted (IRIS.KKNN)#establish a form to verify the accuracy of the sentencetable (fit,iris.test$species)#drawing scatter plot, k-nearest neighbor highlighted in redPcol <-As.character (As.numeric (iris.test$species)) pairs (iris.test[1:4], pch = pcol, col = C ("Green3","Red") [(Iris.test$species! = Fit) +1])

View Code

Guess the Model Code

The complete code is as follows:

SETWD ("E:\\RML") Cars<-Read.csv ("Bus01.csv", header=true,stringsasfactors=TRUE)#Library (KKNN) m<-(Dim (Cars)) [1]ind<-sample (2, M, Replace=true, Prob=c (0.7, 0.3)) Car.train<-cars[ind==1,]car.test<-cars[ind==2,]#first define a formula before calling KknnMyformula <-Type ~ V + A + SOC + MINV + maxv + maxt +MINTCAR.KKNN&LT;-KKNN (myformula,car.train,car.test,distance=1,kernel="Triangular")#Get Car.valuesFit <-fitted (CAR.KKNN)#establish a form to verify the accuracy of the sentenceTable (FIT,CAR.TEST$TYPE,DNN = C ("predict","actual"))#drawing scatter plot, k-nearest neighbor highlighted in redPcol <-As.character (As.numeric (Car.test$type)) pairs (car.test[-8], pch = pcol, col = C ("Green3","Red") [(Car.test$type! = Fit) +1])

The results are as follows:
Graphic distribution

ML (5): KNN algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

ML (5): KNN algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

ML (5): KNN algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support