KNN algorithm and R language implementation (1)

Source: Internet
Author: User

Key points of the algorithm:

KNN (K-nearst neighbor)

1:k:= nearest neighbor Point, d:=training set of data

2:for (point z to be measured)

3: Calculate the distance between z and each sample (x, y)

4: Select a collection of K training samples nearest to Z

5: Count The 4th step to get the point of what kind of more, then Z belongs to which category

6:end for

Data:

Libraryi (ISLR)

Names (Smarket) #R自带数据

KNN Code:

Attach (Smarket)

Train= (year<2005) #分离出2005之前的数据

Train. X=cbind (LAG1,LAG2) [Train,] #把2005年之前的数据中Lag1, Lag2 composition matrix, as training data

Test. X=cbind (LAG1,LAG2) [!train,] #把2005年之后的数据中Lag1, Lag2 composition matrix, as test data

Train. Direction=direction[train] #训练数据的类标号

Direction.2005=direction[!train]

Set.seed (1) #set. Seed (k) results can reproduce K-times

KNN.PRED=KNN (train. X,test. X,train. DIRECTION,K=3)

Table (knn.pred,direction.2005) #confusion matrix

Mean (knn.pred==direction.2005) #accurate rate

Experimental results

direction.2005
Knn.pred down
Down 48 55
UP 63 86
> Mean (knn.pred==direction.2005)
[1] 0.531746
Algorithm Analysis:

Advantages: (i) instance-based learning, which does not need to establish a model, does not have to maintain an abstraction (model) originating from the data, (ii) can generate decision boundaries of arbitrary shapes, and decision trees and rule-based classifiers are confined to line decision boundaries (high dimensional is hyper-planar)

Disadvantage: (i) The classification test sample, the cost is very large, O (n), n is the number of training set, (ii) based on local information to predict, so when the K-hour, the noise data is very sensitive, and the model classification algorithm is based on the overall.

Note: (i) It is important to choose what kind of proximity measurement and data processing, such as (height, weight) to classify people, height variability may be very small (0.2-2.5m), and weight variable range is larger (5-150kg), if the attribute unit is not considered, then the proximity may be only about weight. (ii) K is too small, then the first point around Z determines the class Z, because there is no sufficient reference to training set information, so it is underfitting, if K is too large, an extreme example is k>n, then all the points belong to the majority category.

KNN algorithm and R language implementation (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.