Improving the clustering accuracy of Kmeans by simulated annealing

Source: Internet
Author: User

Improving the clustering accuracy of Kmeans by simulated annealing

Kmeans algorithm is an unsupervised clustering algorithm, because the principle is simple and widely used in the industry, generally encountered in practice clustering problems will tend to use Kmeans to try a look at the results. In my work, I have had many practices in Kmeans, such as user behavior Clustering (mapreduce version), Image Clustering (MPI version), etc. In practice, however, it is found that the initial point selection is closely related to the clustering result, and if the initial point is improperly selected, the clustering result will be poor. In order to solve this problem, this paper tries to combine the heuristic algorithm of simulated annealing with Kmeans clustering, which shows that this method has good effect, and it has been popularized in practice.

K-means algorithm: Input: Number of clusters K, and data containing N data objects. Output: K clusters that meet the minimum variance criteria. Process flow: (1) Select K objects from N data Objects as initial cluster centers, (2) loops (3) to (4) until each cluster no longer changes (3) calculates the distance from each of the objects to these central objects based on the mean value (the center object) of each cluster object and re-dividing the corresponding objects according to the minimum distance; (4) recalculate the mean of each (changed) cluster (center object)

1.1 Step 1

1.2 Step 2

1.3 Step 3

1.4 Step 4

1.5 Step 5

2 The relationship between the initial point and the cluster result

The results of K-means are closely related to the selection of initial points, and are often trapped in local optimality.

2.1 Examples

The selection of the initial point as a practical example will affect the clustering result. The first 3 center points (red, green, blue, three) are randomly initialized, all data points are not clustered and all are marked red by default, as shown in:

The final result of the iteration is as follows:

If the initial point is as follows:

Will eventually converge to such a result:

3 workaround

How to solve it? In practice, we randomly initialize multiple batches of the initial center point, then cluster the initial centers of different batches, and then select a relatively superior result after running. This method not only is not enough automatic, but also has the big probability to get the better result. At present, more research is to combine simulated annealing, genetic algorithm and other heuristic algorithms with Kmeans clustering, which can greatly reduce the dilemma of local optimization. Is the simulated annealing algorithm flowchart.

4 Combat

"On paper to get the final feel shallow, know this matter to preach", only know the principle and not to practice can never deep grasp a certain knowledge. I realized the Kmeans algorithm based on simulated annealing and the common Kmeans algorithm for comparative analysis.

4.1 Experimental Steps

1) First we randomly generate two-dimensional data points for clustering.

2) results obtained based on the native Kmeans.

3) results obtained from Kmeans based on simulated annealing

4.2 Conclusion

From the experimental results, it can be seen that the results of the general error criterion based on simulated annealing Kmeans are: 19309.9.

The result of the general error criterion obtained by the common Kmeans is: 23678.8.

It can be seen that the result of Kmeans based on simulated annealing is better, of course, the complexity of this algorithm is higher, and the time required for convergence is longer.

Improving the clustering accuracy of Kmeans by simulated annealing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.