Common machine learning algorithms Principles + Practice Series 5 (KNN classification +keans Clustering)

Last Update:2016-10-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

One, KNN classification

K-nearest Neighbor K near algorithm is a supervised classification algorithm, the working principle is very simple, there is a sample set, also become a training sample, the sample contains a label, each feature of the new data and the data corresponding to the sample set to compare the characteristics, and then extract the most similar classification of the sample label, K is the most similar data point selected, and the most frequently occurring classification in K-points is the classification of new data. In general, K will not exceed 20. KNN has two details to note, one is the similarity algorithm, commonly include Euclidean distance, cosine distance and so on, the other one in the calculation of similarity before the need for normalization features, such as the use of dispersion standardization (MIN-MAX), all the features are converted to [0,1], the conversion formula is newx= (x-min )/(Max-min).

KNN classification has been widely used in many applications, such as film classification, handwriting recognition and so on. For example, in the film classification, the use of two features, one is the number of kisses, one is the number of fights. Consists of two categories, one is love, the other is action movie.

The following is a simple example of a KNN using Python:

As in the handwriting recognition system, the first is to create a pixel vector for each handwritten image, and then save the corresponding label (that is, representing a specific word, such as 1,2,3,4, etc.) as a training sample, and then find the new handwritten input vector k adjacent words, in the voting decision.

Two, K-means cluster

K-means clustering algorithm is a typical distance-based non-hierarchical clustering algorithm, which belongs to unsupervised algorithm. Simply put, given a data set and the number of k that needs to be divided, the data is divided into K clusters over and over again according to a distance function until convergence is reached. The algorithm uses the average of the objects in the cluster to represent each cluster, and the approximate step is to randomly extract the K data points as the initial cluster center point, then calculate each point to the seed center distance, and assign each data point to the seed center closest to it, once all the data points are assigned to complete, The cluster centers of each cluster are recalculated according to all data points of the cluster; the process repeats until it converges or satisfies a certain termination condition, such as squared error and SSE local minimum.

K mean-value clustering and K the difference between the center points:

The K center point does not use the average of the objects in the cluster as the center of the cluster, but instead chooses the object closest to the average in the cluster.

Application Scenarios:

1) fine operation, for different clusters have different countermeasures, or separately refine the modeling.

2) Cluster categories can be added as new fields in the other model building.

3) data groping, outlier detection, outliers. These points may be noise and need to be excluded.

K mean clustering has several details to note:

1) k How to determine

There's no best way. Can be flexibly selected, such as an interesting strategy is to first use a hierarchical clustering algorithm to initially estimate the number of K.

2) How to choose the initial K points

The common algorithm is random selection. But often the effect is not very good, also can be similar to the method, the line uses the hierarchical clustering algorithm to divide the K clusters, and uses these clusters ' centroid as the initial centroid.

3) method of calculating distances

Commonly used such as European distance, cosine angle similarity degree.

4) Algorithm Stop condition

The maximum number of iterations is reached, and a target function is optimal.

5) Feature standardization

As Min-max method, newx= (x-min)/(Max-min)

Or Z-score method, newx= (X-means)/s, where mean is the mean, S is the standard deviation of the sample data.

6) cluster variables to be few but good

There are several ways to reduce dimensions, such as correlation detection, PCA principal component analysis.

7) Evaluation indicators

RMSSTD (Root-mean-square standard Deviation) Group of all variables of the comprehensive standards deviation, the smaller the similarity in the cluster is higher, the better the clustering effect RMSSTD2, where Si is the sum of the standard deviation of the first variable, p is the number of variables.

Common machine learning algorithms Principles + Practice Series 5 (KNN classification +keans Clustering)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Common machine learning algorithms Principles + Practice Series 5 (KNN classification +keans Clustering)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Common machine learning algorithms Principles + Practice Series 5 (KNN classification +keans Clustering)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support