a summary of KNN algorithm
KNN classification algorithm is simple and effective, can be classified and return.
Core principle: The characteristics and classification of each data of a given sample dataset, the characteristics of the new data and the sample data are compared to find the most similar (nearest neighbor) K (k<=20) data, select K data occurrence most of the classification, as the new data classification.
in short: Birds of a feather flock together second, for example:
As shown in the following illustration:
Blue squares and red triangles are known categories, green circles are our data to be measured and need to be sorted.
If k=3, the nearest 3 neighbors of the green dot are 2 red triangles and a blue square, so the minority obeys the majority, and the green dots belong to the triangle category.
If k=5, the last 5 Lin Jun of a green dot are 2 red triangles and 3 blue, so green dots belong to the blue category.
Distance calculation:
The common distance calculation method is Euclidean distance. European distance: the European distance between sample and sample is: three, the algorithm flow: calculate the distance between the points in the known class dataset and the current point by the distance order the K point of the current point distance to determine the probability of occurrence of the class of the first K point Returns the highest frequency category of current K points as the current point of prediction classification Iv. Code implementation:
# Calculate the distance between the measured point and the sample point
def classify0 (InX, DataSet, Lables, K):
datasetsize = dataset.shape[0]
Diffmat = Tile (InX, Datasetsize, 1)-The DataSet converts the points to be measured into a matrix that is equal to the sample data and then subtracts the matrix of the sample data
Sqdiffmat = Diffmat * * 2 # Square and compute the difference between the sample point and the point to be measured
sqdistances = sqdiffmat.sum (Axis=1) # Calculates the distance between two points and
distances = sqdistances * 0.5 # Open root operation
sorteddistindicies = Distances.argsort () # to the distance between two points from small to large sort
# print sorteddistindicies
ClassCount = {}
for I in range (k):
# Select the K point with the smallest distance
Volteilabel = lables[sorteddistindicies[i]]
Classcount[volteilabel] = classcount.get (Volteilabel, 0) + 1
sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true) return
sortedclasscount[0][0]
v. Summary
Advantages
The KNN algorithm itself is simple and effective, it is a lazy-learning algorithm. There is no need to use the training set for training, the training time complexity of 0.
defects:
The computational complexity is high: it needs to calculate the distance with each sample data, so the classification time complexity of KNN is O (n), which is proportional to the total number of samples.
K-Value setting: K-Value selection on the results of the algorithm is very large, if the K set too small will reduce the classification accuracy, if the K-value set too large, and the test samples belong to the training set contains less data classes, it will increase noise, reduce the classification effect. In general, the K-value is set in a cross test (based on k=1, and k<=20) rule of thumb: K is generally lower than the square root of the training sample number.
The data sample is unbalanced, resulting in greater error: When the sample is unbalanced, such as a large sample size, and other sample capacity is very small, it is possible to enter a new sample, the sample of the K-neighbor of the large size of the sample majority. Resolution: Different weights are given to various samples.