ML (5): KNN algorithm

Source: Internet
Author: User

K Nearest neighbor algorithm, that is k-nearest Neighbor algorithm, short of KNN algorithm, can be simply understood as the nearest to their own K-point to vote to decide what kind of data to classify . This algorithm is a relatively classical algorithm in machine learning, in general, KNN algorithm is relatively easy to understand the algorithm. The k represents the closest to their own K data samples. The KNN algorithm and the K-means algorithm are different, the K-means algorithm is used to cluster, to determine which things are a relatively similar type, and the KNN algorithm is used to do the classification, that is, there is a sample space in the sample into several types, and then, given a data to be classified, Determine which classification the data to classify belongs to by calculating the nearest K sample.

Directory:

    • Algorithm overview
    • Working principle
    • Selection of K-values
    • Normalization of processing
    • KNN R Example
    • Guess the Model Code

Algorithm overview

    • As shown, there are two different kinds of sample data, respectively, with a small blue square and a small red triangle, and the figure in the middle of the green circle marked by the data is to be classified. In other words, now, we do not know the middle of the green data is from which category (blue small square or red small triangle), we have to solve this problem: to this green circle classification
    • From here, you can also see: if k=3, the nearest 3 neighbors of the Green Dot is 2 red small triangles and a small blue square, a few from the majority, based on the statistical method, the green of this to classify point belongs to the Red triangle category.

    • If k=5, the nearest 5 neighbors of the Green dot is 2 red triangles and 3 blue squares, or a few subordinate to the majority, based on the statistical method, the green of this to classify point belongs to the Blue Square category
    • As we can see, when we cannot determine which of the current classification points is from the category of the known classification, we could look at the position characteristics of the data according to the theory of statistics, measure the weight of its neighbors, and classify it as (or allocate) to the larger weight. This is the core idea of K-nearest neighbor algorithm.

Working principle

    • We know the corresponding relationship between each data in the sample set and the classification, and after entering new data with no tags, we compare the new data with the data corresponding to the training set, find the nearest K data of the "distance" , and select the category of the most occurrences of the K data as the classification of the new data.
    • Algorithm description
      1. Calculates the distance between a point in a well-known dataset and the current point
      2. Sort by distance increment order
      3. Select the nearest k point from the current data point
      4. Determine how often the first K points appear in the category
      5. Returns the highest-frequency category as a forecast for the current category
    • The distance calculation method has "Euclidean" (Euclidean distance), "Minkowski" (Minkovski distance), "maximum" (Chebyshev snow Distance), "Manhattan" (absolute distance), "Canberra" (blue distance), or " Minkowski "(Markov distance), etc.
    • In the KNN algorithm, the similarity of two records is determined by the Euclidean distance .
    • Algorithm Disadvantages:
      1. K values need to be pre-set, not self-adapting
      2. Sample imbalance, such as a class of sample capacity is very large, and other class sample capacity is very small, it may lead to when a new sample is entered, the sample of the K neighbors of the bulk class sample of the majority

Selection of K-values

    • In addition to how to define the neighbor's problem, there is a choice of how many neighbors, that is, the K value is defined as how big the problem. Do not underestimate this K-value selection problem, because it will have a significant impact on the results of the K-nearest neighbor algorithm.
    • If you choose a smaller k value, which is equivalent to using a training instance in a smaller field to predict, the "learning" approximation error will decrease, and only training instances that are closer to or similar to the input instance will work on the prediction results, while the problem is that the "learning" estimate error will increase, in other words, The decrease of K value means that the whole model becomes complex and easy to fit;
    • If the large k value is chosen, it is equivalent to using the training example in the larger field to predict, the advantage is that it can reduce the learning estimation error, but the disadvantage is that the approximate error of learning will increase. At this point, the training instance, which is far away from the input instance, also acts on the Predictor, making the prediction error, and the increase in the K value means that the overall model becomes simple.
    • In practical applications, K values generally take a relatively small value, for example, the use of cross-validation (in short, is part of the sample training set, part of the test set) to select the best K value.
    • In general, thevalue of k is preferably the number of the data set of the root, and the best to take odd,? In the example below Iris is 150 data, so here K value is selected 13.

Normalization of processing

    • Data Standardization (normalization) processing is a basic work of data mining, the different evaluation indicators often have different dimensions and dimensional units, this situation will affect the results of data analysis, in order to eliminate the dimensional impact between the indicators, data standardization needs to be processed to solve the comparability of data indicators . After the raw data has been standardized, the indexes are in the same order of magnitude, which is suitable for comprehensive contrast evaluation . Here are two common normalization methods:
    • Min-max Normalization (Min-max normalization): also known as dispersion normalization, is a linear transformation of the original data, which maps the resulting value between [0-1]. The conversion functions are as follows:
      1. Where Max is the maximum value of the sample data, Min is the minimum value of the sample data. One drawback of this approach is that when new data is added, it can lead to changes in Max and min that need to be redefined.
    • Z-score Standardization Method: This method standardizes the data by giving the mean value (mean) and standard deviation of the original data (deviation). The processed data conforms to the standard normal distribution, that is, the mean value is 0, the standard deviation is 1, and the conversion function is:
      1. This is the mean value for all sample data, which is the standard deviation for all sample data.

R Example

  • When R is implemented, the class package can be selected, or the KKNN package can be selected for calculation
  • For example, the example code for Iris is as follows:
    #---------------------R:KNN algorithm--------------------------------Head (IRIS) a&LT;-IRIS[-5]#Remove the column of the tag typeHead (a) a<-scale (a)#Z-score StandardizationStr (a) head (a) train<-a[c (1:25,50:75,100:125),]#Training SetHead (train) test<-a[c (26:49,76:99,126:150),]#Test Set#the next step is to save the training set and the type tag of the test set.Train_lab <-iris[c (1:25,50:75,100:125), 5]test_lab<-iris[c (26:49,76:99,126:150), 5]#In the KNN classification example, the packages used in R are "class packages", "Gmodels packages "#install.packages ("class")Libraryclass)#then we can call the KNN function to build the model .## Data frame, K nearest neighbor poll, Euclidean distancePRE_RESULT&LT;-KNN (train=train,test=test,cl=train_lab,k=13) Table (Pre_result,test_lab)#---------------------r:kknn Bag--------------------------------#install.packages ("KKNN")Library (KKNN) data ("Iris") Dim (Iris) M<-(Dim (Iris)) [1]ind<-sample (2, M, Replace=true, Prob=c (0.7, 0.3)) Iris.train<-iris[ind==1,]iris.test<-iris[ind==2,]#first define a formula before calling Kknn#myformula:species ~ sepal.length + sepal.width + petal.length + petal.widthIRIS.KKNN&LT;-KKNN (species~.,iris.train,iris.test,distance=1,kernel="Triangular") Summary (IRIS.KKNN)#Get Fitted.valuesFit <-fitted (IRIS.KKNN)#establish a form to verify the accuracy of the sentencetable (fit,iris.test$species)#drawing scatter plot, k-nearest neighbor highlighted in redPcol <-As.character (As.numeric (iris.test$species)) pairs (iris.test[1:4], pch = pcol, col = C ("Green3","Red") [(Iris.test$species! = Fit) +1])                  
    View Code

Guess the Model Code

  • The complete code is as follows:
    SETWD ("E:\\RML") Cars<-Read.csv ("Bus01.csv", header=true,stringsasfactors=TRUE)#Library (KKNN) m<-(Dim (Cars)) [1]ind<-sample (2, M, Replace=true, Prob=c (0.7, 0.3)) Car.train<-cars[ind==1,]car.test<-cars[ind==2,]#first define a formula before calling KknnMyformula <-Type ~ V + A + SOC + MINV + maxv + maxt +MINTCAR.KKNN&LT;-KKNN (myformula,car.train,car.test,distance=1,kernel="Triangular")#Get Car.valuesFit <-fitted (CAR.KKNN)#establish a form to verify the accuracy of the sentenceTable (FIT,CAR.TEST$TYPE,DNN = C ("predict","actual"))#drawing scatter plot, k-nearest neighbor highlighted in redPcol <-As.character (As.numeric (Car.test$type)) pairs (car.test[-8], pch = pcol, col = C ("Green3","Red") [(Car.test$type! = Fit) +1])
    • The results are as follows:
    • Graphic distribution

ML (5): KNN algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.