Cite: http://www.cnblogs.com/jerrylead/archive/2011/04/06/2006910.html
1. Concept
K-meansAlgorithmAccept input KThen, n data objects are divided into k clusters to meet the cluster requirements:The object similarity in the same cluster is high, while the object similarity in different clusters is small.. Clustering similarity is calculated by using the mean value of objects in each cluster to obtain a "central object" (gravity center.
2. General Introduction
Clustering belongs to unsupervised learning. In the past, regression, Naive Bayes, SVM, and so on wereWith category tagsY, that is, the classification of the sample has been given in the sample. WhileNo Y is given in the cluster sample, and only feature X is available.For example, assume that stars in the universe can be represented as point sets in 3D space. The purpose of clustering is to locate the potential Category Y of each sample X and put the sample X of the same category Y together. For example, if the stars above are clustered, the result is a cluster. The points in the cluster are close to each other, and the distance between the stars is far.
In the clustering problem, the training sample for us is, each, without y.
The K-means algorithm clusters samples into k clusters. The specific algorithm is described as follows:
1. K cluster centroids are randomly selected. . 2. repeat the following process until convergence { calculate the class that each sample I belongs to for each Class J, recalculate the center of the class } |
K is the number of clusters we have given in advance. It represents the class closest to K in the sample I and its value is one of 1 to K. Centroid represents our speculation about the center of the sample that belongs to the same class. The cluster model is used to explain that all the stars are clustered into k clusters, first, K points (or k stars) in the universe are randomly selected as the centroid of K star clusters. Then, the first step is to calculate the distance from each star to each of k stars, then, select the nearest cluster. In this way, each star in the first step has its own cluster. In the second step, for each cluster, recalculate its center (calculate the mean of all the stars in it ). Repeat steps 1 and 2The center of gravity remains unchanged or changes little..
The K-means clustering of N sample points is displayed. Here, K is 2.
The first problem facing K-means is how to ensure convergence. The previous algorithm emphasizes that the end condition is convergence, which proves that K-means can completely guarantee convergence. The following describes the convergence in a qualitative way. We define the distortion function as follows:
The J function indicates the sum of squares of the distance between each sample point and its center of mass.. K-means is to adjust J to the minimum. Assuming that J does not reach the minimum valueFirst, you can fix the center of each class and adjust the category of each sample to reduce the number of J functions.Similarly, fixed, adjusting the center of each class can also reduce J. These two processes are the process of monotonic decreasing J in the inner loop. When J decreases to the hour, it converges with C at the same time. (Theoretically, multiple groups of different and C values can be used to obtain the minimum value for J, but this phenomenon is rare in fact ).
Since the distortion function J is non-convex function, it means that the minimum value obtained cannot be the global minimum value. That is to say, K-means is a cold choice for the initial position of the center, however, in general, the local optimum achieved by K-means has met the requirement. However, if you are afraid of local optimizationYou can select different initial values to run K-means multiple times, and then take the corresponding and C output values of the smallest J.