KNN-------K-Neighbor algorithm
1.kNN is a non-parametric classifier that does not make distributed assumptions and directly estimates the probability density from the data;
2.kNN is not available for high-dimensional data
Advantages:
1. No need to estimate parameters, no training required;
2. Especially suitable for multi-classification problems (objects with multiple tags).
Disadvantages:
1. When the sample capacity imbalance is, the input has a new sample, the K-O-value of the sample is the majority of large-capacity samples, the classification is unfavorable;
2. The amount of computation is too large to calculate the distance of the text to be classified to each sample.
Improvement measures:
1. Implement the appropriate deletion of the sample attributes and remove the attributes that have less impact on the results;
2. Weighted distance, the distance of the text to be classified with small sample distances is significant.
Algorithm pseudo-code:
1. Calculate the distance between the points in the data set of the known categories and the current point;
2. Ascending order by distance;
3. Select K points with the minimum distance from the current point;
4. Determine the frequency of the category where the first K points are present
5. Return to the category with the highest frequency of the first K points as the current point of the forecast classification
The KNN algorithm for Python code implementation
def KNN (InX, DataSet, labels, k): datasetsize = dataset.shape[0] #shape [0] reads the array one-dimensional length, knowing how many groups Diffmat = Tile (InX, (datasetsize,1))-DataSet #计算欧氏距离 sqdiffmat = diffmat**2 sqdistances = sqdiffmat.sum (Axis=1) distances = sqdistances**0.5 sorteddistindicies = Distances.argsort () #返回数组值从小到大的索引值 classcount={ } for I in range (k): #将labels对应相应的点 Voteilabel = labels[sorteddistindicies[i]] classcount[ Voteilabel] = Classcount.get (voteilabel,0) + 1 sortedclasscount = sorted (Classcount.iteritems (), key= Operator.itemgetter (1), reverse=true) #排序, select the minimum k distance value return sortedclasscount[0][0]
InX: The input vector used for classification. It is about to be categorized.
DataSet: Training Sample Set
Labels: tag vector
K:k value
---KNN algorithm for machine learning algorithm