Machine Learning Classic algorithm and Python implementation--k nearest neighbor (KNN) algorithm

Source: Internet
Author: User

(a) KNN is still a supervised learning algorithmThe KNN (K Nearest neighbors,k nearest neighbor) algorithm is the simplest and best understood theory in all machine learning algorithms. KNN is an instance-based learning that calculates the distance between new data and the characteristic values of the training data, and then chooses K (k>=1) nearest neighbor to classify (vote) or return. If k=1, then the new data is simply assigned to its nearest neighbor class. is the KNN algorithm a supervised study or unsupervised learning? First, consider the definition of supervised learning and unsupervised learning. For supervised learning, the data has a clear label (classification for discrete distributions, regression for continuous distribution), and a model based on machine learning can divide new data into a definite class or get a predictive value. For unsupervised learning, the data does not have a label, and the model that the machine learns is the pattern extracted from the data (extracting deterministic features or clustering, etc.). Clustering, for example, is a model that the machine learns from the learning to determine which original data sets are "more like" the new data. KNN algorithm used for classification, each training data has a clear label, can also be clearly determined that the new data label,knn for the regression will also be based on the value of the neighbors to predict a definite value, so KNN belongs to supervised learning.
The KNN algorithm process is:
    1. Select a distance calculation method to calculate the distance from the new data to the data points in a known category DataSet with all the characteristics of the data
    1. Sort in ascending order of distance, select K points with the least current distance
    1. For discrete classification, the category with the most frequency of K points is returned as a predictive classification, and the weighted value of K points is returned for regression.
(ii) KNN algorithm keyKNN algorithm theory and process is so simple, in order to get a better learning effect, there are a few points to note.
1, all the characteristics of the data should be compared to quantify.
If there is a non-numeric type in the Data feature, it must be quantified by means of a numerical value. For example, if a sample feature contains a color (red-black-blue), there is no distance between the colors, and the distance can be calculated by converting the color to a grayscale value. In addition, the sample has multiple parameters, each of which has its own domain and range of values, and they have a different effect on the distance calculation, such as a larger influence will be over the value of the smaller parameters. In order to be fair, sample parameters have to do some scale processing, and the simplest way is to take the normalized disposition of all the values of the features.
2. A distance function is required to calculate the distance between two samples .
There are many definitions of distance, such as Euclidean distance, cosine distance, Hamming distance, Manhattan distance, and so on, and the method of similarity measurement can be referred to ' ramble: Method of distance and similarity measurement in machine learning '. In general, the Euclidean distance is chosen as the distance metric, but this is only applicable to continuous variables. In the case of discontinuous variables such as text classification, Hamming distance can be used as a measure. In general, if some special algorithms are used to calculate the measurement, the accuracy of K nearest neighbor classification can be significantly improved, such as using the large edge nearest neighbor method or the nearest neighbor component analysis method.
3, determine the value of K
K is a custom constant, and the value of K directly affects the final estimate, and a choice K is worth using the Cross-validate (cross-validation) Error statistic selection method . The concept of cross-validation was previously mentioned as part of a sample of data as a training sample, as part of a test sample, such as selecting 95% as a training sample and remaining as a test sample. Train a machine learning model by training data, and then test its error rate with test data. the Cross-validate (cross-validation) Error statistic selection method is to compare the average error rate of cross-validation with different k values, and to select the K value with the lowest error rate. For example, select k=1,2,3,... , 100 cross-validation is done for each k=i, the average error is calculated, and the smallest one is compared and selected .
(iii) KNN classificationThe training sample is a multidimensional feature space vector, where each training sample has a category tag (likes or dislikes, preserves, or deletes). The classification algorithm often uses the "majority vote" decision, which is the most frequently occurring class of k neighbors as the prediction class. One drawback of the "majority vote" classification is that the more frequently sampled samples will dominate the test point predictions, because they are more likely to appear in the K neighborhood of the test point and the properties of the test point are computed by a sample in the K field. One way to solve this disadvantage is to take the distance from the K-neighbor to the test point when classifying. For example, if the sample to the test point distance is D, then choose 1/D as the neighbor's weight (that is, the neighbor has the weight of the class), the next statistic statistics K neighbors all class label weights and, the most value is the new data point Prediction Class label.
For example, k=5, the example of calculating a new data point to the nearest five neighbors is (1,3,3,4,5), the five neighbor's class tag is (Yes,no,no,yes,no)
In the case of a majority vote, the new data point category is no (3 no,2 Yes) and yes (NO:2/3+1/5,YES:1+1/4) if the distance weight category is considered.
The following Python program is an example of the KNN algorithm (calculate Euclidean distance, majority voting decision): one is the use of KNN algorithm to improve the dating site pairing effect, and the other is using KNN algorithm for handwriting recognition.
An example of an improved pairing effect for dating sites is to determine whether a Helen girl's preferred type (category is very fond, general, and annoying) based on the man's annual mileage, video game time ratio, and weekly ice-cream consumption of three characters. Because of the different values of the three features, the scale strategy used here is normalized.
The handwriting recognition system using the KNN classifier can only recognize numbers 0 through 9. The numbers that need to be identified are processed into the same color and size using the graphics processing software: The wide-high is a black-and-white image of 32 pixels X32 pixels. Although storing images in text format does not make efficient use of memory space, the image is converted to text format for ease of understanding. Each number in the training data is about 200 samples, and the program formats the image sample as a vector, a vector that converts a 32x32 binary image matrix into a 1x1024.
Program TBD

(iv) KNN regression

When the category label of a data point is a continuous value, the KNN algorithm is the same as the KNN classification algorithm, the difference lies in the processing of K-neighbors. KNN regression is to take the K Neighbor class label worthy of weighting as a new data point of the predicted value. The weighted methods are: The average of the attribute values of the K nearest neighbor (worst), the 1/d as the weight (effective to measure the neighbor's weight, so that the nearest neighbor's weight is far greater than the neighbor's weights), the Gaussian function (or other appropriate subtraction function) calculation weight = Gaussian (distance) (The farther away you get the smaller the value, the more accurate the weighted estimate.)
(v) SummaryThe K-nearest neighbor algorithm is the simplest and most efficient algorithm for classifying data, and its learning is based on the example, we must have the training sample data close to the actual data when using the algorithm. The K-Nearest neighbor algorithm must hold all data sets, and if the training data set is large, a large amount of storage space must be used. In addition, because distance values must be calculated for each data in the dataset, it can be very time-consuming to actually use it. Another drawback of the K-nearest neighbor algorithm is that it cannot give any data infrastructure information, so we cannot know what the average instance sample and typical instance samples have.

Machine Learning Classic algorithm and Python implementation--k nearest neighbor (KNN) algorithm

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.