Kmeans cluster k value and selection of the center point of the initial cluster

Source: Internet
Author: User

This paper is based on Anand Rajaraman and Jeffrey David Ullman co-author, Wang bin translation of "Big data-Internet large-scale data mining and distributed processing" book.

The Kmeans algorithm is the most commonly used clustering algorithm, the main idea is: in the case of a given k value and K initial class cluster center point, each point (i.e. data record) is divided into the class cluster that is represented by the center point of its nearest class cluster, after all points are allocated, the center point of the cluster is recalculated according to all points within a class cluster ( ), and then iterate through the allocation points and update the cluster center points until the center point of the cluster changes very little or reaches the specified number of iterations.

The Kmeans algorithm itself is relatively simple, but it is very important to determine the K value and the center point of k initial cluster to be good or bad for the clustering effect.

1. Determine the center point of the K initial cluster

The simplest way to determine the center point of the initial cluster is to randomly select K points as the initial cluster center point, but the method is less effective in some cases, as follows (the data is generated using five two-yuan normal Gaussian distributions, and the color represents the clustering effect):

"Big Data" in the book mentions K initial cluster point selection there are two ways: 1) Select the distance from each other as far as the K points 2) first, the data with hierarchical clustering algorithm or canopy algorithm clustering, after the K cluster, select a point from each cluster, the point can be the center point of the cluster, Or the closest point to the center point of the cluster.

1) Select K points as far away from the batch as possible

First randomly selects a point as the center point of the first initial class cluster, then selects the point farthest from the point as the center point of the second initial class cluster, then selects the most recent distance from the first two points as the center point of the third initial class cluster, and so on until the K Initial cluster center point is selected.

This method after I test the result is very good, using this method to determine the initial class cluster point after the run Kmeans results can be perfectly distinguished five cluster:

2) The hierarchical clustering or canopy algorithm is used for initial clustering, and then the center point of these clusters is taken as the initial cluster center point of the Kmeans algorithm.

The commonly used hierarchical clustering algorithm has birch and rock, not introduced here, the following is a brief introduction to the canopy algorithm, mainly from the Mahout wiki:

First define two distances of T1 and t2,t1> T2. Randomly remove a point p from the set S of the initial point, and then for each point I in S, calculate the distance between the point I and the point P, if the distance is less than T1, then the point I is added to the canopy represented by points P, if the distance is less than T2, then point I is removed from the set S, and the point I is added to the canopy represented by points p. Once the iteration is finished, randomly select a point from the collection S as the new point P, and repeat the above steps.

Canopy algorithm execution will get a lot of canopy, you can think of each canopy is a cluster, and Kmeans and other hard partitioning algorithm, canopy clustering results each point may belong to more than canopy. We can choose the closest data point from the center point of each canopy, or directly select the center point of each canopy as the initial K-Cluster center point of the Kmeans.

2. Determination of the K-value.

The big data mentions: given an appropriate cluster indicator, such as an average radius or diameter, the indicator will rise very slowly as long as we assume that the number of clusters is equal to or higher than the number of real clusters, and the indicator can rise sharply when trying to get less than the true number of clusters.

The diameter of a class cluster refers to the maximum distance between any two points within a class cluster.

The radius of a class cluster is the maximum of the distance from all points within a class cluster to the center of a class cluster.

Say less nonsense,. is when the value of K is from 2 to 9 o'clock, the clustering effect and the cluster indicator:

The left is the K value from 2 to 7 o'clock of the cluster effect, the right is the K value from 2 to 9 o'clock Class cluster indicator change curve, here I select the cluster indicator is the average centroid distance of K-class clusters weighted average value. It can be seen clearly that, when the K value 5 o'clock, the class cluster indicator is the fastest downward trend, so the correct value of K should be 5. For the following is the specific data:

    1. <span style= "" >2 a Cluster
    2. Weighted average of the radius of all clusters 8.51916676443
    3. Weighted average of the average centroid distance for all clusters 4.82716260322
    4. 3 clusters
    5. Weighted average of the radius of all clusters 7.58444829472
    6. Weighted average of the average centroid distance for all clusters 3.37661824845
    7. 4 clusters
    8. Weighted average of the radius of all clusters 5.65489660064
    9. Weighted average of the average centroid distance for all clusters 2.22135360453
    10. 5 clusters
    11. Weighted average of the radius of all clusters 3.67478798553
    12. Weighted average of the average centroid distance for all clusters 1.25657641195
    13. 6 clusters
    14. Weighted average of the radius of all clusters 3.44686996398
    15. Weighted average of the average centroid distance for all clusters 1.20944264145
    16. 7 clusters
    17. Weighted average of the radius of all clusters 3.3036641135
    18. Weighted average of the average centroid distance for all clusters 1.16653919186
    19. 8 clusters
    20. Weighted average of the radius of all clusters 3.30268530308
    21. Weighted average of the average centroid distance for all clusters 1.11361639906
    22. 9 clusters
    23. Weighted average of the radius of all clusters 3.17924400582
    24. Weighted average of the average centroid distance for all clusters 1.07431888569</span>
Copy Code

Transferred from: http://www.cnblogs.com/kemaswill/archive/2013/01/26/2877434.html

Kmeans cluster k value and selection of the center point of the initial cluster

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.