K-means Algorithm Summary

Last Update:2015-09-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Principle

Clustering is a unsupervised learning method, its essence is based on some distance measurement, so that the similarity between the same cluster maximization, the similarity between different clusters is minimized, that is, the similar objects into the same cluster, the non-similar objects into different clusters. Clustering differs from classification in that the input object of a cluster does not need to have a category tag, and the final composition is determined by the algorithm used.

In clustering, K-means is widely used because of its simple and easy-to-implement advantages.

Suppose that a collection is a collection of D-dimensional vector spaces that represents the first I object (or "data point") in a collection, which represents a collection of K-clustered centers, which represents the J-cluster identity, which is used to represent the cluster to which each data point belongs.

The K-means algorithm is an iterative greedy descent algorithm, whose objective function is non-convex, which is why the local optimal solution can only be obtained, and the objective function expression is as follows:

The flow of the algorithm mainly includes, first of all, we randomly select the K points in the set as the initial clustering center, and then according to the collection of each point allocated to the nearest cluster, and finally according to the data points in each cluster to update the cluster center, so repeatedly executed the next two steps until the algorithm convergence. The K-means algorithm is to aggregate the data points in the set into K classes by iterating, and the core steps are:

　　1) Assign data points to the cluster center closest to it

2) Update the cluster center (the mean value of each data point coordinate in the cluster)

The detailed steps of the algorithm are shown in table 1,

Table 1 Specific steps of the K-means algorithm

2. Defects

There are many defects in the K-means algorithm, and table 2 lists the common defects of the K-means algorithm and the methods to solve them.

Table 2 K-means algorithm defects

3. Expansion

3.1 Nuclear methods

In order to deal with complex clusters, we can improve the processing ability of K-means algorithm for complex data by means of nuclear method. We know that the cluster boundary is nonlinear in the original space, but it can be linear if it is in the high dimensional space implied by the kernel function.

3.2 Accelerated K-means

The K-means algorithm has a long-time flaw in processing super-large data, so a lot of improved algorithms are proposed for this disadvantage. For example, it is possible to reduce the computational amount of this step by using the kd-tree or by using triangular inequalities.

3.3 Flexible K-means

The flexible K-means is relative to the rigid K-means, and the rigid K-means is the basic K-means algorithm that divides each data point into a single cluster. In the flexible K-means algorithm, each data point is assigned to each cluster according to probability, that is, flexible k-means, each data point has a weight (probability) vector, which is used to describe the probability that each data point belongs to each cluster.

4. Summary

The K-means algorithm uses simple iterations to assemble the data into K classes, and the core steps of the iteration are: (1) Update the cluster Center, (2) Update the cluster identity. Although it has many drawbacks, its simplicity, mobility and good scalability make it one of the most commonly used algorithms in clustering.

Reference documents

[1] Xindong Wu,vipin Kumar. Ten algorithms for data mining [M]. Beijing: Tsinghua University Press. 2014:19-30.

K-means Algorithm Summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

K-means Algorithm Summary

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

K-means Algorithm Summary

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support