Clustering algorithm, unsupervised learning category, there is no explicit classification information.
A given n training sample {X1,X2,X3,...,XN}
The process description for the Kmeans algorithm is as follows:
1. Create a K-point as the starting centroid point, c1,c2,...,c
K
2. Repeat the following process until the convergence
Traverse all Samples X
I
Traverse all centroid C
J
Record the distance between the centroid and the sample
Assigns a sample to its nearest centroid
For each class, the mean value of all samples is computed and used as the new centroid
Shows the effect of K-means clustering on n sample points, where K takes 2.
A few things to note:
K Points how to take
1. Select the K points as far away from
First randomly select a point P1 as the centroid of the first cluster, and then select the point P1 the furthest point P2 as the centroid of the second cluster,
Then select the point with the maximum distance from the front P1 and P2 as the centroid of the third cluster. Max (min (d (P1), D (p2)))
And so on, choose K points.
2. Use hierarchical clustering or canopy algorithm to first cluster, using the center point of these clusters as the centroid of Kmeans initial cluster
Requirements: Samples are relatively small, such as hundreds of to thousands of (hierarchical clustering overhead); k smaller than sample size
How to determine the K value
PS: Each class is called a cluster, the diameter of the cluster: the maximum distance between any two points in a cluster, the radius of a cluster: the maximum distance from the intra-cluster point to the cluster centroid
Given an appropriate cluster indicator, it can be a weighted average of the cluster average radius, the cluster average diameter , or the average centroid distance (the weight can be the number of points within the cluster)
K values are respectively taken in 1,2,4,8,16 ....
Basically, when the number of clusters is lower than the real number, cluster indicator will decrease with the number of clusters, and when the number of clusters is higher than the real number, cluster indicator will tend to be stable.
Find the turning point shown in the figure, first determine the approximate range of K, and then find the value of K by binary search
Algorithm stop condition
1. Specify an iteration count to stop
2. Target function convergence
Reference: http://www.cnblogs.com/jerrylead/archive/2011/04/06/2006910.html
K-means Clustering