K-means Algorithm Summary

Source: Internet
Author: User

1. Principle

Clustering is a unsupervised learning method, its essence is based on some distance measurement, so that the similarity between the same cluster maximization, the similarity between different clusters is minimized, that is, the similar objects into the same cluster, the non-similar objects into different clusters. Clustering differs from classification in that the input object of a cluster does not need to have a category tag, and the final composition is determined by the algorithm used.

In clustering, K-means is widely used because of its simple and easy-to-implement advantages.

Suppose that a collection is a collection of D-dimensional vector spaces that represents the first I object (or "data point") in a collection, which represents a collection of K-clustered centers, which represents the J-cluster identity, which is used to represent the cluster to which each data point belongs.

The K-means algorithm is an iterative greedy descent algorithm, whose objective function is non-convex, which is why the local optimal solution can only be obtained, and the objective function expression is as follows:

The flow of the algorithm mainly includes, first of all, we randomly select the K points in the set as the initial clustering center, and then according to the collection of each point allocated to the nearest cluster, and finally according to the data points in each cluster to update the cluster center, so repeatedly executed the next two steps until the algorithm convergence. The K-means algorithm is to aggregate the data points in the set into K classes by iterating, and the core steps are:

  1) Assign data points to the cluster center closest to it

2) Update the cluster center (the mean value of each data point coordinate in the cluster)

The detailed steps of the algorithm are shown in table 1,

Table 1 Specific steps of the K-means algorithm

2. Defects

There are many defects in the K-means algorithm, and table 2 lists the common defects of the K-means algorithm and the methods to solve them.

Table 2 K-means algorithm defects

3. Expansion

3.1 Nuclear methods

In order to deal with complex clusters, we can improve the processing ability of K-means algorithm for complex data by means of nuclear method. We know that the cluster boundary is nonlinear in the original space, but it can be linear if it is in the high dimensional space implied by the kernel function.

3.2 Accelerated K-means

The K-means algorithm has a long-time flaw in processing super-large data, so a lot of improved algorithms are proposed for this disadvantage. For example, it is possible to reduce the computational amount of this step by using the kd-tree or by using triangular inequalities.

3.3 Flexible K-means

The flexible K-means is relative to the rigid K-means, and the rigid K-means is the basic K-means algorithm that divides each data point into a single cluster. In the flexible K-means algorithm, each data point is assigned to each cluster according to probability, that is, flexible k-means, each data point has a weight (probability) vector, which is used to describe the probability that each data point belongs to each cluster.

4. Summary

The K-means algorithm uses simple iterations to assemble the data into K classes, and the core steps of the iteration are: (1) Update the cluster Center, (2) Update the cluster identity. Although it has many drawbacks, its simplicity, mobility and good scalability make it one of the most commonly used algorithms in clustering.

Reference documents

[1] Xindong Wu,vipin Kumar. Ten algorithms for data mining [M]. Beijing: Tsinghua University Press. 2014:19-30.

K-means Algorithm Summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.