[Clustering algorithm] Advantages and disadvantages of K-means and its improvement

Last Update:2017-05-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Advantages and disadvantages of clustering algorithm]k-means and its improvement

"Turn": http://blog.csdn.net/u010536377/article/details/50884416

A brief review of K-means cluster

The first clustering method that everyone touches, nine to ten, is K-means clustering. The algorithm is easy to understand and easy to implement. In fact, almost all machine learning and data mining algorithms have their advantages and disadvantages. So what is the disadvantage of K-means?
Summary for the following:
(1) sensitive to outliers and isolated points;
(2) k value selection;
(3) Selection of the initial cluster center;
(4) Only spherical clusters can be found.
For the 4-point reason, readers can think for themselves, not difficult to understand. For the above four shortcomings, the improvement measures are introduced in turn.

Improvement 1

First for (1), for outliers and outliers sensitive, how to solve? In a previous blog, the author mentions that the LOF algorithm of outlier detection can reduce the influence of outliers and outliers on clustering effect by removing outliers and then clustering.

Improvement 2

K-Value selection problem, the K-means algorithm's K-Value adaptive optimization method is mentioned in the master's thesis of Li Fang in Anhui University. The method is summarized below.
First, the algorithm improves the following major drawbacks of the K-means algorithm:
1) You must first give the K (the number of clusters to be generated) and the K value is difficult to select. It is not known beforehand that the given data should be divided into what category is optimal.
2) The selection of the initial clustering Center is a problem of K-means.
Li Fang Design algorithm idea is this: can be given at the beginning of a suitable value to K, through a k-means algorithm to get a cluster center. For the obtained cluster center, according to the distance of k clusters, the nearest class is merged, so the number of cluster centers decreases, and when it is used for the next cluster, the corresponding clustering number decreases, and finally the appropriate number of clusters is obtained. It is possible to determine the number of clusters by a judging value of E to get a suitable position to stop and not to continue merging the cluster centers. Repeat the cycle until the evaluation function converges, and finally get the clustering result of the better clustering number.

Resources

Li fang. Research on K-Value Adaptive optimization method for K-means algorithm [D]. Anhui University, 2015.

Improvement 3

Optimization of the selection of the initial cluster center. One sentence is summarized as follows: Select K points that are as far away from the batch as possible. The specific selection steps are as follows.

First randomly selects a point as the center point of the first initial class cluster, then selects the point farthest from the point as the center point of the second initial class cluster, then selects the most recent distance from the first two points as the center point of the third initial class cluster, and so on until the K Initial cluster center point is selected.

There is a solution for this problem. I've used it before. Students familiar with Weka should know that Weka in the cluster has an algorithm called canopy algorithm.
The hierarchical clustering or canopy algorithm is used for initial clustering, and then the center point of these clusters is taken as the initial cluster center point of the Kmeans algorithm. This method is also very effective for the selection of K-values.
Click for resources

Improvement 4

The root cause of a spherical cluster can only be obtained is the way the distance is measured. In Li Hui Rao's master thesis, the improvement of K_means clustering method and its application are mentioned in this paper, based on the improvement of 2 kinds of measures, it is possible to find non-negative and elliptic-like data after improvement. But for this improvement, personally think, and did not solve the problem of k-means in this shortcoming, if the data set has irregular data, often through density-based clustering algorithm more suitable, such as Descan algorithm.

Resources

Li Hui Rao. Improvement and application of K–means clustering method [D]. Northeast Agricultural University, 2014.

[Clustering algorithm] Advantages and disadvantages of K-means and its improvement

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Clustering algorithm] Advantages and disadvantages of K-means and its improvement

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Clustering algorithm] Advantages and disadvantages of K-means and its improvement

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support