Improving the clustering accuracy of Kmeans by simulated annealing

Last Update:2015-06-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Kmeans algorithm is an unsupervised clustering algorithm, because the principle is simple and widely used in the industry, generally encountered in practice clustering problems will tend to use Kmeans to try a look at the results. In my work, I have had many practices in Kmeans, such as user behavior Clustering (mapreduce version), Image Clustering (MPI version), etc. In practice, however, it is found that the initial point selection is closely related to the clustering result, and if the initial point is improperly selected, the clustering result will be poor. In order to solve this problem, this paper tries to combine the heuristic algorithm of simulated annealing with Kmeans clustering, which shows that this method has good effect, and it has been popularized in practice.

K-means algorithm: Input: Number of clusters K, and data containing N data objects. Output: K clusters that meet the minimum variance criteria. Process flow: (1) Select K objects from N data Objects as initial cluster centers, (2) loops (3) to (4) until each cluster no longer changes (3) calculates the distance from each of the objects to these central objects based on the mean value (the center object) of each cluster object and re-dividing the corresponding objects according to the minimum distance; (4) recalculate the mean of each (changed) cluster (center object)

1.1 Step 1

1.2 Step 2

1.3 Step 3

1.4 Step 4

1.5 Step 5

2 The relationship between the initial point and the cluster result

The results of K-means are closely related to the selection of initial points, and are often trapped in local optimality.

2.1 Examples

The selection of the initial point as a practical example will affect the clustering result. The first 3 center points (red, green, blue, three) are randomly initialized, all data points are not clustered and all are marked red by default, as shown in:

The final result of the iteration is as follows:

If the initial point is as follows:

Will eventually converge to such a result:

3 workaround

How to solve it? In practice, we randomly initialize multiple batches of the initial center point, then cluster the initial centers of different batches, and then select a relatively superior result after running. This method not only is not enough automatic, but also has the big probability to get the better result. At present, more research is to combine simulated annealing, genetic algorithm and other heuristic algorithms with Kmeans clustering, which can greatly reduce the dilemma of local optimization. Is the simulated annealing algorithm flowchart.

4 Combat

"On paper to get the final feel shallow, know this matter to preach", only know the principle and not to practice can never deep grasp a certain knowledge. I realized the Kmeans algorithm based on simulated annealing and the common Kmeans algorithm for comparative analysis.

4.1 Experimental Steps

1) First we randomly generate two-dimensional data points for clustering.

2) results obtained based on the native Kmeans.

3) results obtained from Kmeans based on simulated annealing

4.2 Conclusion

From the experimental results, it can be seen that the results of the general error criterion based on simulated annealing Kmeans are: 19309.9.

The result of the general error criterion obtained by the common Kmeans is: 23678.8.

It can be seen that the result of Kmeans based on simulated annealing is better, of course, the complexity of this algorithm is higher, and the time required for convergence is longer.

Improving the clustering accuracy of Kmeans by simulated annealing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Improving the clustering accuracy of Kmeans by simulated annealing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Improving the clustering accuracy of Kmeans by simulated annealing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support