[Clustering algorithm] Advantages and disadvantages of K-means and its improvement

Source: Internet
Author: User

[Advantages and disadvantages of clustering algorithm]k-means and its improvement

"Turn": http://blog.csdn.net/u010536377/article/details/50884416

A brief review of K-means cluster

The first clustering method that everyone touches, nine to ten, is K-means clustering. The algorithm is easy to understand and easy to implement. In fact, almost all machine learning and data mining algorithms have their advantages and disadvantages. So what is the disadvantage of K-means?
Summary for the following:
(1) sensitive to outliers and isolated points;
(2) k value selection;
(3) Selection of the initial cluster center;
(4) Only spherical clusters can be found.
For the 4-point reason, readers can think for themselves, not difficult to understand. For the above four shortcomings, the improvement measures are introduced in turn.

Improvement 1

First for (1), for outliers and outliers sensitive, how to solve? In a previous blog, the author mentions that the LOF algorithm of outlier detection can reduce the influence of outliers and outliers on clustering effect by removing outliers and then clustering.

Improvement 2

K-Value selection problem, the K-means algorithm's K-Value adaptive optimization method is mentioned in the master's thesis of Li Fang in Anhui University. The method is summarized below.
First, the algorithm improves the following major drawbacks of the K-means algorithm:
1) You must first give the K (the number of clusters to be generated) and the K value is difficult to select. It is not known beforehand that the given data should be divided into what category is optimal.
2) The selection of the initial clustering Center is a problem of K-means.
Li Fang Design algorithm idea is this: can be given at the beginning of a suitable value to K, through a k-means algorithm to get a cluster center. For the obtained cluster center, according to the distance of k clusters, the nearest class is merged, so the number of cluster centers decreases, and when it is used for the next cluster, the corresponding clustering number decreases, and finally the appropriate number of clusters is obtained. It is possible to determine the number of clusters by a judging value of E to get a suitable position to stop and not to continue merging the cluster centers. Repeat the cycle until the evaluation function converges, and finally get the clustering result of the better clustering number.

Resources

Li fang. Research on K-Value Adaptive optimization method for K-means algorithm [D]. Anhui University, 2015.

Improvement 3

Optimization of the selection of the initial cluster center. One sentence is summarized as follows: Select K points that are as far away from the batch as possible. The specific selection steps are as follows.

First randomly selects a point as the center point of the first initial class cluster, then selects the point farthest from the point as the center point of the second initial class cluster, then selects the most recent distance from the first two points as the center point of the third initial class cluster, and so on until the K Initial cluster center point is selected.

There is a solution for this problem. I've used it before. Students familiar with Weka should know that Weka in the cluster has an algorithm called canopy algorithm.
The hierarchical clustering or canopy algorithm is used for initial clustering, and then the center point of these clusters is taken as the initial cluster center point of the Kmeans algorithm. This method is also very effective for the selection of K-values.
Click for resources

Improvement 4

The root cause of a spherical cluster can only be obtained is the way the distance is measured. In Li Hui Rao's master thesis, the improvement of K_means clustering method and its application are mentioned in this paper, based on the improvement of 2 kinds of measures, it is possible to find non-negative and elliptic-like data after improvement. But for this improvement, personally think, and did not solve the problem of k-means in this shortcoming, if the data set has irregular data, often through density-based clustering algorithm more suitable, such as Descan algorithm.

Resources

Li Hui Rao. Improvement and application of K–means clustering method [D]. Northeast Agricultural University, 2014.

[Clustering algorithm] Advantages and disadvantages of K-means and its improvement

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.