Common machine learning algorithms Principles + Practice Series 5 (KNN classification +keans Clustering)

Source: Internet
Author: User

One, KNN classification

K-nearest Neighbor K near algorithm is a supervised classification algorithm, the working principle is very simple, there is a sample set, also become a training sample, the sample contains a label, each feature of the new data and the data corresponding to the sample set to compare the characteristics, and then extract the most similar classification of the sample label, K is the most similar data point selected, and the most frequently occurring classification in K-points is the classification of new data. In general, K will not exceed 20. KNN has two details to note, one is the similarity algorithm, commonly include Euclidean distance, cosine distance and so on, the other one in the calculation of similarity before the need for normalization features, such as the use of dispersion standardization (MIN-MAX), all the features are converted to [0,1], the conversion formula is newx= (x-min )/(Max-min).

KNN classification has been widely used in many applications, such as film classification, handwriting recognition and so on. For example, in the film classification, the use of two features, one is the number of kisses, one is the number of fights. Consists of two categories, one is love, the other is action movie.

The following is a simple example of a KNN using Python:

As in the handwriting recognition system, the first is to create a pixel vector for each handwritten image, and then save the corresponding label (that is, representing a specific word, such as 1,2,3,4, etc.) as a training sample, and then find the new handwritten input vector k adjacent words, in the voting decision.

Two, K-means cluster

K-means clustering algorithm is a typical distance-based non-hierarchical clustering algorithm, which belongs to unsupervised algorithm. Simply put, given a data set and the number of k that needs to be divided, the data is divided into K clusters over and over again according to a distance function until convergence is reached. The algorithm uses the average of the objects in the cluster to represent each cluster, and the approximate step is to randomly extract the K data points as the initial cluster center point, then calculate each point to the seed center distance, and assign each data point to the seed center closest to it, once all the data points are assigned to complete, The cluster centers of each cluster are recalculated according to all data points of the cluster; the process repeats until it converges or satisfies a certain termination condition, such as squared error and SSE local minimum.

K mean-value clustering and K the difference between the center points:

The K center point does not use the average of the objects in the cluster as the center of the cluster, but instead chooses the object closest to the average in the cluster.

Application Scenarios:

1) fine operation, for different clusters have different countermeasures, or separately refine the modeling.

2) Cluster categories can be added as new fields in the other model building.

3) data groping, outlier detection, outliers. These points may be noise and need to be excluded.

K mean clustering has several details to note:

1) k How to determine

There's no best way. Can be flexibly selected, such as an interesting strategy is to first use a hierarchical clustering algorithm to initially estimate the number of K.

2) How to choose the initial K points

The common algorithm is random selection. But often the effect is not very good, also can be similar to the method, the line uses the hierarchical clustering algorithm to divide the K clusters, and uses these clusters ' centroid as the initial centroid.

3) method of calculating distances

Commonly used such as European distance, cosine angle similarity degree.

4) Algorithm Stop condition

The maximum number of iterations is reached, and a target function is optimal.

5) Feature standardization

As Min-max method, newx= (x-min)/(Max-min)

Or Z-score method, newx= (X-means)/s, where mean is the mean, S is the standard deviation of the sample data.

6) cluster variables to be few but good

There are several ways to reduce dimensions, such as correlation detection, PCA principal component analysis.

7) Evaluation indicators

RMSSTD (Root-mean-square standard Deviation) Group of all variables of the comprehensive standards deviation, the smaller the similarity in the cluster is higher, the better the clustering effect RMSSTD2, where Si is the sum of the standard deviation of the first variable, p is the number of variables.

Common machine learning algorithms Principles + Practice Series 5 (KNN classification +keans Clustering)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.