R language Learning note-K nearest Neighbor algorithm

Source: Internet
Author: User
Tags square root

k Nearest Neighbor Algorithm (KNN) Refers to a sample if most of the K- nearest samples in the feature space belong to a category, the sample also falls into this category and has the characteristics of the sample on this category. That is, each sample can be represented by its nearest K-neighbor. KNN algorithm is suitable for classification and regression. KNN algorithm is widely used in recommender system, semantic search and anomaly detection.

KNN algorithm classification schematic diagram:

Does the green dot in the figure belong to the red triangle or the blue square? If K=5 (the nearest 5 neighbors to the green dot, the dashed circle), then there are 3 blue squares that are the "nearest neighbors" of the green dots, with a ratio of 3/5, so the green dots should be classified as blue squares, and if k=3 (3 neighbors nearest to the green Dot, solid circle), two red triangles are green dots. "Nearest neighbor", the proportion is 2/3, then the green dot should be classified into the Red triangle category.

As can be seen from the above, the method determines the category of the sample to be divided only according to the category of the nearest one or several samples in the categorical decision-making.

KNN algorithm Implementation steps:

1. Data preprocessing

2. Build training set and test set data

3. Set parameters, such as K value (k generally select the square root of the sample data amount, 3~10)

4. Maintain a high-to-small priority queue of size K for storing nearest neighbor training tuples. Randomly selects K tuples from the training tuple as the initial nearest neighbor group, and calculates the distance of the test tuple to the K-tuple to the priority queue for the training tuple designator and distance.

5. Traverse the set of training tuples, calculate the distance L from the current training tuple to the test tuple, and the maximum distance between the resulting distance and the priority queue Lmax

6. Make a comparison. If L>=lmax, the tuple is discarded and the next tuple is traversed . If L < Lmax, delete the tuple with the maximum distance in the priority queue and put the current training tuple in the priority queue

7. After traversal, compute Most classes of K tuples in the priority queue and use them as categories of test tuples

8. Test the meta-set test after the calculation of the error rate, continue to set different K values retraining, and finally take the minimum error rate K value.

R Language Implementation process:

The functions of K-Nearest neighbor algorithm analysis in R language include the KNN function in the class package , thetrain function in the caret package , and the KKNN function in the KKNN package .

KNN (train, test, cl, k = 1, l = 0, prob = FALSE, Use.all = TRUE)
Parameter meaning:
Train: A matrix or data frame containing a training set
Test: A matrix or data frame containing a test set
CL: A factor variable for classifying training sets
K: Number of neighbors
L: Minimum number of votes for limited decision-making
Prob: whether to calculate the probability of a forecast group
Use.all: The method of control node, that is, if there are multiple points near the point and the distance of the sample to be judged by default, these points are used as discriminant sample points; When the parameter is set to False, a point is randomly selected as the nearest discriminant point.

(Sample Data Description: The sample data description of a woman according to the date of the number of miles per year, the percentage of time spent on playing video games, the number of ice cream litres consumed per week, dividing their date into three types of preference)

Code:

#导入分析数据mydata <-read.table ("c:/users/cindy/desktop/marriage/datingtestset.txt") str (mydata) colnames (MyData) <-C (' Flight mileage ', ' video game time ratio ', ' Edible ice cream number ', ' preference category ') Head (MyData) #数据预处理, normalized norfun <-function (x) {  Z <-(x-min (x))/(max (x)- Min (x))  return (z)}data <-as.data.frame (Apply (mydata[,1:3],2,norfun)) data$ Preference category <-mydata[,4]# Set up test set and training set sample library (caret) Set.seed (123) Ind <-Createdatapartition (y=data$ preference classification, times = 1,p=0.5,list = F) testdata <-data[-ind,]traindata <-Data[ind,] #KNN算法library (Class) Kresult <-KNN (train = traindata[,1:3],test= testdata[,1:3],cl=traindata$ preferences, k=3) #生成实际与预判交叉表和预判准确率table (testdata$ preferences, Kresult) sum (diag (table (testdata$ preference category , Kresult)))/sum (table (testdata$ preference category, Kresult))

Operation Result:

According to the results, the correct rate of the classification is 95%.

The advantages and disadvantages of KNN algorithm:

Advantages:

1. Easy to understand and implement

2. Suitable for classification of rare events

3. Especially suitable for multi-classification problem (multi-modal, object with multiple category tags), KNN is better than SVM performance

Disadvantages:

A large amount of computation is needed to calculate the "distance" between the new data points and each data in the sample set to determine if the first K neighbors

Improved:

Classification efficiency, delete the attribute that has less effect on the classification result, the weighted K nearest neighbor algorithm is used to assign different weights to the sample points according to the distance, and the KKNN function in the KKNN package uses the weighted KNN algorithm.

                                                                                                                                                                                                                    

2018-04-30 22:31:25

R language Learning note-K nearest Neighbor algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.