Key points of the algorithm:
KNN (K-nearst neighbor)
1:k:= nearest neighbor Point, d:=training set of data
2:for (point z to be measured)
3: Calculate the distance between z and each sample (x, y)
4: Select a collection of K training samples nearest to Z
5: Count The 4th step to get the point of what kind of more, then Z belongs to which category
6:end for
Data:
Libraryi (ISLR)
Names (Smarket) #R自带数据
KNN Code:
Attach (Smarket)
Train= (year<2005) #分离出2005之前的数据
Train. X=cbind (LAG1,LAG2) [Train,] #把2005年之前的数据中Lag1, Lag2 composition matrix, as training data
Test. X=cbind (LAG1,LAG2) [!train,] #把2005年之后的数据中Lag1, Lag2 composition matrix, as test data
Train. Direction=direction[train] #训练数据的类标号
Direction.2005=direction[!train]
Set.seed (1) #set. Seed (k) results can reproduce K-times
KNN.PRED=KNN (train. X,test. X,train. DIRECTION,K=3)
Table (knn.pred,direction.2005) #confusion matrix
Mean (knn.pred==direction.2005) #accurate rate
Experimental results
direction.2005
Knn.pred down
Down 48 55
UP 63 86
> Mean (knn.pred==direction.2005)
[1] 0.531746
Algorithm Analysis:
Advantages: (i) instance-based learning, which does not need to establish a model, does not have to maintain an abstraction (model) originating from the data, (ii) can generate decision boundaries of arbitrary shapes, and decision trees and rule-based classifiers are confined to line decision boundaries (high dimensional is hyper-planar)
Disadvantage: (i) The classification test sample, the cost is very large, O (n), n is the number of training set, (ii) based on local information to predict, so when the K-hour, the noise data is very sensitive, and the model classification algorithm is based on the overall.
Note: (i) It is important to choose what kind of proximity measurement and data processing, such as (height, weight) to classify people, height variability may be very small (0.2-2.5m), and weight variable range is larger (5-150kg), if the attribute unit is not considered, then the proximity may be only about weight. (ii) K is too small, then the first point around Z determines the class Z, because there is no sufficient reference to training set information, so it is underfitting, if K is too large, an extreme example is k>n, then all the points belong to the majority category.
KNN algorithm and R language implementation (1)