Overview
Simply put, the K-Nearest neighbor algorithm (k-nearest-neighbors classification) is classified by measuring the distance between different eigenvalue values.
- Advantages: High accuracy, insensitive to outliers, no data input assumptions
- Cons: High computational complexity, high spatial complexity
- Use data range: Numeric and nominal
- How it works: to determine which category the test sample belongs to, look for the first k samples of all training samples that are "distance" from the test sample, and then see which of the K samples belong to most of them, then consider which category the test sample belongs to. To put it simply is to let the most similar K-like vote to decide.
- The distance, generally the most commonly used is the multi-dimensional space of European distance. The dimension here refers to the feature dimension, that is, the sample has several characteristics that belong to several dimensions.
Distance metric notation for nearest neighbor
- Euclidean distance
- Manhattan Distance
- Chebyshev Snow Distance
- Minkowski distance
- Markov distance
- Babbitt distance
Selection of K-values
Do not underestimate this K-value selection problem, because it will have a significant impact on the results of the K-nearest neighbor algorithm. As Dr. Hangyuan Li's book on "Statistical Learning methods" says:
- If you choose a smaller k value, which is equivalent to using a training instance in a smaller field to predict, the "learning" approximation error will decrease, and only training instances that are closer to or similar to the input instance will work on the prediction results, while the problem is that the "learning" estimate error will increase, in other words, The decrease of K value means that the whole model becomes complex and easy to fit;
- If the large k value is chosen, it is equivalent to using the training example in the larger field to predict, the advantage is that it can reduce the learning estimation error, but the disadvantage is that the approximate error of learning will increase. At this point, the training instance, which is far away from the input instance, also acts on the Predictor, making the prediction error, and the increase in the K value means that the overall model becomes simple. K=n is completely unworthy, because no matter what the input instance is, it simply predicts that it belongs to the most tired in the training instance, the model is too simple, and ignores a lot of useful information in the training instance.
- In practical applications, K values generally take a relatively small value, for example, the use of cross-validation (in short, is part of the sample training set, part of the test set) to select the best K value.
Advantages and disadvantages of KNN algorithm
- Mature theory, simple thinking, that can be used to do the classification can also be used to do regression;
- Can be used for nonlinear classification;
- The complexity of training time is O (n);
- No assumptions about the data, high accuracy, insensitive to outlier.
Disadvantages
- Large computational capacity;
- Sample imbalance problem (i.e., a large number of samples in some categories, and few other samples);
- Requires a lot of memory.
Case one of the specific application cases
There are 4 sets of data, and (1,1.1) and () are defined as Class A, (0,0) and (0,0.1) are Class B. Under the face (0.5,0.5) to classify, judging it as a, b which category.
Algorithm process:
(1) Calculate the distance between the point in the data set of the known category and the current point;
(2) Sorting in ascending order of distance;
(3) Select K points with the minimum distance from the current point;
(4) Determine the frequency of occurrence of the category of the first k points;
(5) Returns the category with the highest frequency of the first K points as the predicted classification of the current point.
Specific code:
fromNumPyImport*Importoperator fromOsImportListdir def classify0(InX, DataSet, labels, k):Datasetsize = dataset.shape[0] Diffmat = Tile (InX, (Datasetsize,1))-DataSet Sqdiffmat = diffmat**2Sqdistances = Sqdiffmat.sum (axis=1) Distances = sqdistances**0.5 #计算距离Sorteddistindicies = Distances.argsort () classcount={} forIinchRange (k): Voteilabel = Labels[sorteddistindicies[i]] Classcount[voteilabel] = Classcount.get (Voteilabel,0) +1#选择距离最小的k个点Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=True)returnsortedclasscount[0][0]#排序 def createdataset():Group = Array ([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels = [' A ',' A ',' B ',' B ']returnGroup, Labelsgroup,labels=createdataset () classify0 ([0.5,0.5],group,labels,3)
Output:
‘B‘
Case TWO:
- Data Set Introduction
Iris Iris DataSet, is a plant that is often used as a case of machine learning. There are 150 instances in this dataset, each containing four dimensions of the flower, the length of the sepals, the width, the length of the petals, and the width. There are 3 categories of the flower: Iris setosa, Iris versicolor, Iris virginica.
- Using Python's machine learning library sklearn:SKLearnExample.py
The library contains many algorithms for machine learning, such as KNN. Next we describe how to invoke the KNN algorithm:
from sklearn import neighbors #导入包含KNN算法模块 From sklearn import datasets #导入数据集模块 KNN = neighbors. Kneighborsclassifier () #调用分类器方法 iris = Datasets.load_iris () #导入数据 print iris #分类规则: Iris setosa, Iris versicolor, Iris Virginica 0, 1, 2 for knn.fit (Iris.data, iris.target) #建立模型 Predictedlabel = Knn.predict ([[0.1 , 0.2 , 0.3 , 0.4 ]]) # predict what type of new object belongs to print predictedlabel
Results:
[0]
The above is how to use the Python inside the Sklearn library to do the KNN algorithm call.
Next, we introduce the algorithm that is suitable for KNN by writing program.
Case Three
Basic steps:
- Load the data set;
- Calculate the distance;
- Returns the nearest K-neighbor;
- The classification is classified by the principle of "minority obedience to Majority";
- Calculate the accuracy of the predicted value.
ImportCsv#用于读取数据ImportRandomImportMathImportoperator#导入数据 def loaddataset(filename, split, trainingset=[], testset=[]):#加载数据集 withOpen (filename,' RB ') asCSVFile:#将filename导入为csv格式的文件. (' RB ' read/write mode)Lines = Csv.reader (csvfile)#读取文件行数DataSet = list (lines)#转化为list的数据结构 forXinchRange (len (DataSet)-1): forYinchRange4): Dataset[x][y] = float (Dataset[x][y])ifRandom.random () < split:#将数据分为两部分, add to training set and test set separatelyTrainingset.append (Dataset[x])Else: Testset.append (Dataset[x])#计算距离 def euclideandistance(Instance1, Instance2, length):#传入两个实例及维度Distance =0 forXinchRange (length):#所有维度距离的平方和Distance + = POW ((instance1[x]-instance2[x]),2)returnMATH.SQRT (distance)#返回最近的K个label def getneighbors(Trainingset, Testinstance, K):#testInstance测试集中的一个数据distances = []#定义一个空的容器length = Len (testinstance)-1 forXinchRange (len (trainingset)):#计算测试集 (one) distance to each training setDist = Euclideandistance (testinstance, trainingset[x], length) distances.append ((Trainingset[x], Dist))#将所有的距离放在定义好的空容器diastancesDistances.sort (Key=operator.itemgetter (1))#距离从小到大排序Neighbors = [] forXinchRange (k): Neighbors.append (distances[x][0])returnNeighbors#返回最近的k个邻居#对邻居进行分类, find the most category def getResponse(Neighbors):Classvotes = {} forXinchRange (len (neighbors)): Response = neighbors[x][-1]ifResponseinchClassvotes:classvotes[response] + =1 Else: Classvotes[response] =1Sortedvotes = sorted (Classvotes.iteritems (), Key=operator.itemgetter (1), reverse=True)returnsortedvotes[0][0]#计算正确率 def getaccuracy(testset, predictions):correct =0 forXinchRange (len (testset)):iftestset[x][-1] = = Predictions[x]: correct + =1 return(Correct/float (Len (testset))) *100.0 def main(): # Prepare DataTrainingset=[]#创建两个空的测试集和训练集testset=[] Split =0.67#将2/3 data is divided into training set, 1/3 divided into test setLoaddataset (R '/home/duxu/exercise/iris.csv ', Split, Trainingset, Testset)Print ' Train set: '+ repr (len (trainingset))Print ' Test set: '+ repr (len (testset))# Generate PredictionsPredictions=[] k =3 forXinchRange (len (testset)): Neighbors = Getneighbors (Trainingset, testset[x], k) result = GetResponse (neighbors) Predictions.append (Result) print (' > predicted= '+ repr (Result) +', actual= '+ repr (testset[x][-1]) accuracy = getaccuracy (testset, predictions) print (' accuracy: '+ repr (accuracy) +'% ') Main ()
Output:
TrainSet: -TestSet: the> predicted=' Iris-setosa ', actual=' Iris-setosa '> predicted=' Iris-setosa ', actual=' Iris-setosa '> predicted=' Iris-setosa ', actual=' Iris-setosa '> predicted=' Iris-setosa ', actual=' Iris-setosa '> predicted=' Iris-setosa ', actual=' Iris-setosa '> predicted=' Iris-setosa ', actual=' Iris-setosa '> predicted=' Iris-setosa ', actual=' Iris-setosa '> predicted=' Iris-setosa ', actual=' Iris-setosa '> predicted=' Iris-setosa ', actual=' Iris-setosa '> predicted=' Iris-setosa ', actual=' Iris-setosa '> predicted=' Iris-setosa ', actual=' Iris-setosa '> predicted=' Iris-setosa ', actual=' Iris-setosa '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-virginica ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-versicolor ', actual=' Iris-versicolor '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica '> predicted=' Iris-virginica ', actual=' Iris-virginica 'Accuracy:97.95918367346938%
The results show that there are 100 instances of the training set, there are 50 instances of the test set, then the prediction results and the actual classification of the test set are printed out, and finally the correct rate of the prediction is about 98%, which is more ideal.
K-Nearest Neighbor algorithm (KNN)