Introduction to K-Nearest neighbor Clustering

Source: Internet
Author: User
Tags square root
Brief introduction

in all machine learning algorithms, K Nearest neighbor (K-nearest neighbors, KNN) is relatively simple. Although it is simple, it turns out to be very effective and even better in certain tasks . It can be used for classification and regression problems! However, it is more commonly used for classification problems. in This paper, we will first understand the principle behind the KNN algorithm, study the different methods of calculating the distance between points, and then finally implement the algorithm with Python.

Directory

    1. A simple example to understand the principle behind KNN
    2. How the KNN algorithm works
    3. Methods for calculating distances between points
    4. How to choose K-Factor
    5. KNN cluster instance (Python)

1. A simple example to understand the principle behind KNN

Let's start with a simple example. Consider that we have two types of data, blue squares and red triangles, which are distributed in the middle of a two-dimensional one. So if we have a green circle this data, we need to determine whether this data belongs to the Blue Block category, or the same as the red triangle?

We will first find the nearest point from this green circle, and according to similarity, the nearest to the Green circle is helpful in judging its category. How many will it take to judge? This number is K.

    • If k=3, it means we choose the nearest 3 points from the green circle to judge, because the proportion of the red triangle is 2/3, so we think the green circle is similar to the red triangle.
    • If k=5, because the Blue quad-square ratio is 3/5, the green circle is given a blue quad-square class.

Its idea is simple: if a sample is the most similar in the K in the feature space (that is, the nearest neighbor in the feature space), the sample belongs to that category.

How the 2.KNN algorithm works

Algorithm Flow:

    1. Calculates the distance between points in a well-known category dataset and the current point;
    2. Sort in ascending order of distance;
    3. Select K points with the minimum distance from the current point;
    4. Determine the frequency of the category where the first k points are present;
    5. Returns the category with the highest frequency of the first K points as the forecast classification of the current point
3. How to calculate the distance between points

The first step is to calculate the distance between the new point and each training point. There are several ways to calculate this distance, the most common of which is Euclid, Manhattan (for continuous variables) and Hamming distance (for categorical variables).

    • Euclidean distance: Euclidean distance calculates the square root of the sum of squared deviations between the new point (X) and the existing point (Y).
    • Manhattan Distance: This is the distance between the actual vectors, using their sum of absolute difference values.
    • Hamming distance: Used for categorical variables. If the value (x) and the value (y) are the same, the distance d will be equal to 0. otherwise d = 1.


Once you have measured the distance between the new observations and the points in the training set, the next step is to select the nearest point. The number of points to consider is defined by the value of K.

4. How to choose K-Factor

First, let's try to understand the exact effect of k in the algorithm. If we see the last example, assuming that all training observations remain constant, given the K value we can create boundaries for each class. These borders will be separated by two individuals. Let's try to see the effect of the "K" value on the class boundary. Here are the different boundaries for separating two classes with different K values:

looking closely, it can be found that the boundary becomes smoother as the K value increases. As k increases to infinity, it eventually becomes full blue or full red. we can determine it based on the error calculation of our training and validation set (after all, our ultimate goal is to minimize the error).

The following are the training errors and validation errors with different K values:

For very low k values (assuming k = 1), the model excessively fits the training data, which results in a high error rate on the validation set. On the other hand, the model does poorly on both training and validation sets for large K-values. If you look closely, you will find that the validation error curve reaches the minimum value at k = 9. The k value is the best value for the model (it will vary for different datasets). This curve is called an "elbow curve" and is typically used to determine the K value.

We can also use grid search techniques to find the best K values. We will implement this in the next section.

5.KNN Clustering Instances

You can download the dataset here.

1. Import data:

Import NumPy as NP Import Pandas as PD import Matplotlib.pyplot as Plt
# View Data data.head ()

2. Handling of variables

Because class variables are non-numeric, we need to handle them.

name_list=data['Name'].unique () for in name_list:    data.loc[data['name']==i,'name']=list (name_ List). index (i) +1data.head ()

3. Split training sets and test sets

We can use data 70% for training and 30% for testing.

#Create trainset and Testset fromSklearn.model_selectionImportTrain_test_splittrain, Test= Train_test_split (data, test_size = 0.3) X_train= Train.drop ('Name', Axis=1) Y_train= train['Name']x_test= Test.drop ('Name', Axis = 1) Y_test= test['Name']

4. Preprocessing-zoom function

We need to do a normalization of the data, that is, the mean value is 0 and the variance is 1.

Benefit: Eliminate the effects of different magnitude levels between features.

# preprocessing–scaling The features  from Import  = Minmaxscaler (feature_range= (0, 1== = = PD). DataFrame (x_test_scaled)

5. Check the classification error rate under different K values

We use the square root of the mean square error to measure.

#Import Required Packages fromSklearnImportNeighbors fromSklearn.metricsImportMean_squared_error fromMathImportSqrtrmse_val= []#To store Rmse values for different k forKinchRange (20): K= K+1Model= Neighbors. Kneighborsregressor (n_neighbors =K) Model.fit (X_train, Y_train)#fit the ModelPred=model.predict (X_test)#Make prediction on test setError = sqrt (Mean_squared_error (y_test,pred))#Calculate RmseRmse_val.append (Error)#Store RMSE Values    Print('RMSE value for k='K'is :', error)

To draw a classification error curve:

# plotting the RMSE values against K values #  Curve.plot ()

  

As we discussed, when we take k = 1 o'clock, we get a very high rmse value. As we increase the K value, the Rmse value decreases. At k = 7 o'clock, the RMSE is approximately 0.2266 and rises when the K value is further increased. In this case, we can judge that K = 7 will bring us the best results.

These are predictions that use our training data set. Now let's predict the value of the test data set.

6. Application Gridsearch

To determine the value of K, it is a cumbersome and tedious process to draw ' elbow curves ' each time. You only need to use Gridsearch to find the best value.

 from Import  = {'n_neighbors': [2,3,4,5,6,7,8,9== GRIDSEARCHCV (KNN, params, cv=5) model.fit (x_train,y_train) Model.best_params_

As you can see, k=4 is indeed the best choice.

Summary

  KNN algorithm is one of the simplest classification algorithms. Even so simple, it can provide highly competitive results. The KNN algorithm can also be used for regression problems. The only difference from the approach discussed would be to use the average of the nearest neighbor instead of voting from the nearest neighbor.

Advantages

    • Easy to use, easy to understand, can be used to do the classification can also be used to do regression;
    • Can be used for numerical data and discrete data;
    • Training time complexity is O (n); no data input hypothesis;
    • Not sensitive to outlier values.

Disadvantages:

    • High computational complexity, high spatial complexity;
    • Sample imbalance problem (that is, there are a large number of samples in some categories, while the number of other samples is very small);
    • The general value is very big time does not use this, the computation quantity is too big. But a single sample can not be too small, otherwise prone to error points.
    • The biggest drawback is the inability to give the intrinsic meaning of the data.

The above to KNN simple summary and application, hope to you some help!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.