1. KNN algorithm

Source: Internet
Author: User
Tags square root

characteristics: The properties of the sample. For example: The color of watermelon, the shape of the melon, the sound of percussion is characteristicLabel: the category of the sample. For example: Good melon "and" bad melon "the two Judgments is the labelOne IntroductionKNN classification algorithm is one of the most mature and simple machine learning algorithms in theory,can be used for classification, but also for regression    Core idea: To calculate a sample in the feature space of the k most adjacent samples, K samples most of which belong to a category, the sample also belongs to this category, and has the characteristics of the sample on this category. This method determines the category to which the sample is to be divided, depending on the category of one or more adjacent samples in determining the classification decision.     Ii. examples and explanationsas shown in the following:
    blue and red triangles are already classified data, the current task is to classify green blocks to determine whether it belongs to the blue Square or red triangle.

If the k=3 ( solid coil ), the red triangle accounted for 2/3, then judged to be the red triangle;

If the k=5 ( dashed Circle ), the Blue Square is 3/5, then the Blue Square is judged.

1.Distance generally using Euclidean distance or Manhattan distance:2. Algorithm execution Process:

1) Calculate the distance between the test sample and each training sample;

2) Sort by the increment relation of distance;

3) Select K points with a minimum distance;

4) Determine the occurrence frequency of the category of the first k points;

5) return the category with the highest frequency in the first K points as a predictive classification for the test sample.

  Third, the Code implementation
1 #knn-k-Nearest Algorithm2 #Inx is the vector to be classified, the dataset is the training data set3 #labels for training set correspondence classification, k nearest neighbor algorithm4 defclassify0 (InX, DataSet, labels, k):5Datasetsize = dataset.shape[0]#the number of rows to get the dataset6     7Diffmat = Np.tile (InX, (datasetsize,1))-DataSet#the corresponding difference value8Sqdiffmat = diffmat**2#The squared difference9Sqdistances = Sqdiffmat.sum (Axis=1)#The sum of the squared differenceTendistances = sqdistances**0.5#The square root of the squared sum of the difference One     #calculate the Euclidean distance between the vector to be classified and each training data set A      -Sorteddistindicies = Distances.argsort ()#after sorting, statistics the classification of the preceding K data -      theclasscount={}#Dictionary -      forIinchRange (k): -Voteilabel = Labels[sorteddistindicies[i]]#labels can be a dictionary. -Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1 +          -Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true)#Sort again +      A     returnSORTEDCLASSCOUNT[0][0]#The first one is the most category

Iv. Analysis 1. Advantages:1)simple and intuitive, no training ( lazy Learning (lazy-learning)), and no need to fit parameters. 2. Disadvantages:1) The selection of K value, the results of the algorithm, the effect is very large: if the K value is small, only the training samples close to the test sample will play a role in the prediction results, prone to overfitting; if the K value is large, the advantage is that it can reduce the learning estimation error, the disadvantage is that the learning approximate error increases, Because training samples that are farther from the test sample also work on the predictions, making predictions error. in practical applications, K values generally choose a smaller value, usually using cross-validation method to select the most K-value,K is usually not greater than 20 integers, the upper limit is the root of N.  as the number of training samples tends to infinity and k=1, the error rate does not exceed twice times the Bayesian error rate, and if K is also inclined to infinity, the error rate tends to be Bayesian error rate. 2) Uneven classification training sample may have a large error: when the sample is unbalanced, if the sample size of a class is large, and the other class sample capacity is very small, it is possible that when a new sample is entered, the sample of the large-capacity class in the K-neighbor is the majority. 3) High computational capacity: Each test sample is traversed through the training sample to calculate the distance. in other words, KNN The time complexity is O (n) , so KNN generally applies to datasets with a small number of samples. 4)Unable to give the basic structure information of all samples, that is, to determine the most accurate characteristics of the sample

From for notes (Wiz)



1. KNN algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.