Summary of k nearest neighbor algorithm

Source: Internet
Author: User

What is k nearest neighbor?

K-Nearest neighbor a non-parametric learning algorithm can be used in the classification problem, but also can be used in the regression problem.

  • What is non-parametric learning?
    In general, machine learning algorithms have corresponding parameters to learn, such as the weight parameters and bias parameters in the linear regression model, the C and gamma parameters of SVM, and the learning of these parameters depends on a certain learning strategy. In comparison, K-nearest neighbor algorithm can be said to be the simplest, but also the most easy to understand a machine learning algorithm.
  • K Nearest neighbor algorithm idea?
    Specifically, in a sample to be tested around the K nearest point, and then based on the K-point decision, if it is a classification problem, the decision result is the K points in the most categories, if it is a regression problem, the result value is the mean value of K-point target value;
  • So how to choose K value?
    The K-value selection will have a significant impact on the results of the K-nearest neighbor algorithm.
    How do you explain it exactly? Starting with a special case, if the K value is the smallest and equals 1, this means that each time the input instance is predicted, only the most recent instance is considered, and the predicted result is closely related to the nearest instance point, and if the point happens to be a noise point, a miscarriage will occur. At the same time, it also leads to the overfitting of the model, the complexity increases, if the K value becomes very large, equals n (the total number of training instances), and finally, regardless of how the distance is measured, the final result is the class that appears most in the training instance, the model becomes exceptionally simple, It is possible to predict as long as the class is always the most output.
    Overall, if the K value is too small to predict with a training instance in a smaller neighborhood, the approximate error of "learning" is reduced, and the disadvantage is that the estimation error of "learning" increases, and the predicted result is very sensitive to the instance point of the nearest neighbor, which can be an error if the neighbor's instance point happens to be noise. In other words, the decrease of k value means that the whole model becomes complex and prone to fit;
    if the K value is too large , it is equivalent to using a large neighborhood of training examples to predict, the advantages can reduce the learning estimate error, the disadvantage is that the learning approximate error increases, and the input instance is farther from the training instance will also play a role in the prediction, so that the prediction error, the increase of K value means that the overall model becomes simple. If k=n, then no matter what the input instance is, it will simply predict that it belongs to the most classes in the training instance, the model is too simple and completely ignores the large amount of useful information in the training instance.
  • How is "recent" determined?
    Distance measurement, generally by calculating Euclidean distance to compare, of course, there are other options, such as: Manhattan distance, cos value and so on;
  • How is the final result determined? (Classification decision rules)
    The voting method is generally used, in the selected K nearest Neighbor Tag value, select the most frequently occurring as the input instance of the predicted value.
    Overall, in the case of a dataset, how the K-nearest neighbor algorithm behaves depends on the three elements mentioned above: the choice of K-value, the way of distance measurement and the classification decision rule.
Algorithm description

For each point in the dataset of the Unknown category property, do the following:

    1. Calculates the distance between points in a well-known category dataset and the current point;
    2. Sort in ascending order of distance;
    3. Select the nearest k point from the current point;
    4. Determine the frequency of the category where the first k points are present;
    5. Returns the category with the highest frequency of the first K points as the prediction classification for the current point.
Advantages

The algorithm is simple, the model is easy to understand, there is no learning training process, usually does not need to make a great adjustment to have a good performance; so it is often used as a baseline (worst-case, most basic solution) for a problem

Limitations
    • When the instance features are too large, or the most of the examples are sparse, the model is not satisfactory.
    • When the data set is too large, the classification process becomes very slow;
      So the actual process, can only be used to deal with some small data sets, while the data characteristics of the situation, not commonly used!

Summary of k nearest neighbor algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.