Kmeans cluster k value and selection of the center point of the initial cluster

Last Update:2015-09-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This paper is based on Anand Rajaraman and Jeffrey David Ullman co-author, Wang bin translation of "Big data-Internet large-scale data mining and distributed processing" book.

The Kmeans algorithm is the most commonly used clustering algorithm, the main idea is: in the case of a given k value and K initial class cluster center point, each point (i.e. data record) is divided into the class cluster that is represented by the center point of its nearest class cluster, after all points are allocated, the center point of the cluster is recalculated according to all points within a class cluster ( ), and then iterate through the allocation points and update the cluster center points until the center point of the cluster changes very little or reaches the specified number of iterations.

The Kmeans algorithm itself is relatively simple, but it is very important to determine the K value and the center point of k initial cluster to be good or bad for the clustering effect.

1. Determine the center point of the K initial cluster

The simplest way to determine the center point of the initial cluster is to randomly select K points as the initial cluster center point, but the method is less effective in some cases, as follows (the data is generated using five two-yuan normal Gaussian distributions, and the color represents the clustering effect):

"Big Data" in the book mentions K initial cluster point selection there are two ways: 1) Select the distance from each other as far as the K points 2) first, the data with hierarchical clustering algorithm or canopy algorithm clustering, after the K cluster, select a point from each cluster, the point can be the center point of the cluster, Or the closest point to the center point of the cluster.

1) Select K points as far away from the batch as possible

First randomly selects a point as the center point of the first initial class cluster, then selects the point farthest from the point as the center point of the second initial class cluster, then selects the most recent distance from the first two points as the center point of the third initial class cluster, and so on until the K Initial cluster center point is selected.

This method after I test the result is very good, using this method to determine the initial class cluster point after the run Kmeans results can be perfectly distinguished five cluster:

2) The hierarchical clustering or canopy algorithm is used for initial clustering, and then the center point of these clusters is taken as the initial cluster center point of the Kmeans algorithm.

The commonly used hierarchical clustering algorithm has birch and rock, not introduced here, the following is a brief introduction to the canopy algorithm, mainly from the Mahout wiki:

First define two distances of T1 and t2,t1> T2. Randomly remove a point p from the set S of the initial point, and then for each point I in S, calculate the distance between the point I and the point P, if the distance is less than T1, then the point I is added to the canopy represented by points P, if the distance is less than T2, then point I is removed from the set S, and the point I is added to the canopy represented by points p. Once the iteration is finished, randomly select a point from the collection S as the new point P, and repeat the above steps.

Canopy algorithm execution will get a lot of canopy, you can think of each canopy is a cluster, and Kmeans and other hard partitioning algorithm, canopy clustering results each point may belong to more than canopy. We can choose the closest data point from the center point of each canopy, or directly select the center point of each canopy as the initial K-Cluster center point of the Kmeans.

2. Determination of the K-value.

The big data mentions: given an appropriate cluster indicator, such as an average radius or diameter, the indicator will rise very slowly as long as we assume that the number of clusters is equal to or higher than the number of real clusters, and the indicator can rise sharply when trying to get less than the true number of clusters.

The diameter of a class cluster refers to the maximum distance between any two points within a class cluster.

The radius of a class cluster is the maximum of the distance from all points within a class cluster to the center of a class cluster.

Say less nonsense,. is when the value of K is from 2 to 9 o'clock, the clustering effect and the cluster indicator:

The left is the K value from 2 to 7 o'clock of the cluster effect, the right is the K value from 2 to 9 o'clock Class cluster indicator change curve, here I select the cluster indicator is the average centroid distance of K-class clusters weighted average value. It can be seen clearly that, when the K value 5 o'clock, the class cluster indicator is the fastest downward trend, so the correct value of K should be 5. For the following is the specific data:

<span style= "" >2 a Cluster
Weighted average of the radius of all clusters 8.51916676443
Weighted average of the average centroid distance for all clusters 4.82716260322
3 clusters
Weighted average of the radius of all clusters 7.58444829472
Weighted average of the average centroid distance for all clusters 3.37661824845
4 clusters
Weighted average of the radius of all clusters 5.65489660064
Weighted average of the average centroid distance for all clusters 2.22135360453
5 clusters
Weighted average of the radius of all clusters 3.67478798553
Weighted average of the average centroid distance for all clusters 1.25657641195
6 clusters
Weighted average of the radius of all clusters 3.44686996398
Weighted average of the average centroid distance for all clusters 1.20944264145
7 clusters
Weighted average of the radius of all clusters 3.3036641135
Weighted average of the average centroid distance for all clusters 1.16653919186
8 clusters
Weighted average of the radius of all clusters 3.30268530308
Weighted average of the average centroid distance for all clusters 1.11361639906
9 clusters
Weighted average of the radius of all clusters 3.17924400582
Weighted average of the average centroid distance for all clusters 1.07431888569</span>

Copy Code

Transferred from: http://www.cnblogs.com/kemaswill/archive/2013/01/26/2877434.html

Kmeans cluster k value and selection of the center point of the initial cluster

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Kmeans cluster k value and selection of the center point of the initial cluster

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Kmeans cluster k value and selection of the center point of the initial cluster

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support