1. KNN algorithm

Last Update:2018-01-23 Source: Internet

Author: User

Tags square root

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

characteristics: The properties of the sample. For example: The color of watermelon, the shape of the melon, the sound of percussion is characteristicLabel: the category of the sample. For example: Good melon "and" bad melon "the two Judgments is the labelOne IntroductionKNN classification algorithm is one of the most mature and simple machine learning algorithms in theory,can be used for classification, but also for regression Core idea: To calculate a sample in the feature space of the k most adjacent samples, K samples most of which belong to a category, the sample also belongs to this category, and has the characteristics of the sample on this category. This method determines the category to which the sample is to be divided, depending on the category of one or more adjacent samples in determining the classification decision. Ii. examples and explanationsas shown in the following:
blue and red triangles are already classified data, the current task is to classify green blocks to determine whether it belongs to the blue Square or red triangle.

If the k=3 ( solid coil ), the red triangle accounted for 2/3, then judged to be the red triangle;

If the k=5 ( dashed Circle ), the Blue Square is 3/5, then the Blue Square is judged.

1.Distance generally using Euclidean distance or Manhattan distance:2. Algorithm execution Process:

1) Calculate the distance between the test sample and each training sample;

2) Sort by the increment relation of distance;

3) Select K points with a minimum distance;

4) Determine the occurrence frequency of the category of the first k points;

5) return the category with the highest frequency in the first K points as a predictive classification for the test sample.

Third, the Code implementation

1 #knn-k-Nearest Algorithm2 #Inx is the vector to be classified, the dataset is the training data set3 #labels for training set correspondence classification, k nearest neighbor algorithm4 defclassify0 (InX, DataSet, labels, k):5Datasetsize = dataset.shape[0]#the number of rows to get the dataset6     7Diffmat = Np.tile (InX, (datasetsize,1))-DataSet#the corresponding difference value8Sqdiffmat = diffmat**2#The squared difference9Sqdistances = Sqdiffmat.sum (Axis=1)#The sum of the squared differenceTendistances = sqdistances**0.5#The square root of the squared sum of the difference One     #calculate the Euclidean distance between the vector to be classified and each training data set A      -Sorteddistindicies = Distances.argsort ()#after sorting, statistics the classification of the preceding K data -      theclasscount={}#Dictionary -      forIinchRange (k): -Voteilabel = Labels[sorteddistindicies[i]]#labels can be a dictionary. -Classcount[voteilabel] = Classcount.get (voteilabel,0) + 1 +          -Sortedclasscount = sorted (Classcount.iteritems (), Key=operator.itemgetter (1), reverse=true)#Sort again +      A     returnSORTEDCLASSCOUNT[0][0]#The first one is the most category

Iv. Analysis 1. Advantages:1)simple and intuitive, no training ( lazy Learning (lazy-learning)), and no need to fit parameters. 2. Disadvantages:1) The selection of K value, the results of the algorithm, the effect is very large: if the K value is small, only the training samples close to the test sample will play a role in the prediction results, prone to overfitting; if the K value is large, the advantage is that it can reduce the learning estimation error, the disadvantage is that the learning approximate error increases, Because training samples that are farther from the test sample also work on the predictions, making predictions error. in practical applications, K values generally choose a smaller value, usually using cross-validation method to select the most K-value,K is usually not greater than 20 integers, the upper limit is the root of N. as the number of training samples tends to infinity and k=1, the error rate does not exceed twice times the Bayesian error rate, and if K is also inclined to infinity, the error rate tends to be Bayesian error rate. 2) Uneven classification training sample may have a large error: when the sample is unbalanced, if the sample size of a class is large, and the other class sample capacity is very small, it is possible that when a new sample is entered, the sample of the large-capacity class in the K-neighbor is the majority. 3) High computational capacity: Each test sample is traversed through the training sample to calculate the distance. in other words, KNN The time complexity is O (n) , so KNN generally applies to datasets with a small number of samples. 4)Unable to give the basic structure information of all samples, that is, to determine the most accurate characteristics of the sample

From for notes (Wiz)

1. KNN algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More