In the supervision of learning, there is a label information to assist the machine to learn the similarities between similar samples, in the prediction only to determine the given sample and which category of training samples of the most similar can be. In unsupervised learning, no longer have the guidance of the label information, encountered a one-dimensional or two-dimensional data division problem, people with the naked eye is very easy to complete, but the machine is dumbfounded, figure (1) described very image.
But to deal with the high dimensional data, the human brain can do nothing, and eventually have to design algorithms for the machine to complete. How to divide all the samples into several clusters (cluster), and the samples in each cluster have higher similarity, this is the ultimate goal of clustering analysis algorithm. Here, the most classical K-means algorithm is used to explain the cut-in point. The goal of the K-means algorithm is to set X={X (1) of M-m samples, X (2), ⋯,x (m) |x (i) ∈rn} x={x (1), X (2), ⋯,x (m) |x (i) ∈rn} are divided into K K clusters (K≤m k≤m), and their criteria function forms are as follows: J (c , μ) =∑i=1m∥x (i) −μ (i) c∥2 (1) (1) J (c,μ) =∑i=1m∥x (i) −μc (i) ∥2 where C c is the sample cluster distribution, μμ is a cluster center point, μ (i) cμc (i) is the cluster center corresponding to the sample x (i) x (i). The benchmark function calculates the sum of the distances between all the sample points and their corresponding cluster centers. The cluster with the lowest criterion function is divided into the most optimal clustering. K-means algorithm description please figure below.
The inner loop of the algorithm completes two tasks: one is to divide each sample into its nearest cluster center, and the other is to take the sample mean of the same cluster as the new cluster center. The termination condition of the algorithm can be three kinds: 1 The change of criterion function value is less than a threshold, 2 the cluster Center is no longer changing within a certain range, and 3 reaches the specified number of iterations T T. The execution steps of the K-means are shown in Figure (2): (a) randomly initialized sample points; (b) Randomly setting the cluster center, (c) allocating the nearest cluster center to the sample point, (d) The cluster Center is updated to the mean value of all samples in the cluster; repeat (c) and (d) until convergent.
The criterion function here is not a convex function, it is impossible to find the global optimal solution, however, it can be guaranteed to converge to the local optimal solution, and it is analyzed as follows: When the class cluster to which sample x (i) x (i) is updated, it always chooses its nearest cluster center, so ∥x (i) −μ (i) c∥2∥x (i) −μc (i) ∥2 in each iterative process is not incremented, then can guarantee that the criterion function J J is also not increment, the cluster center is updated to all the samples in the cluster mean value also can guarantee J J not increment. The criterion function is biased to the cluster center, and the deviation guide is 0 to obtain the update rule of the cluster Center ∂j∂μj=∂∂μj∑mi=11{c (i) =j}∥x (i) −μ (i) c∥2=2∑mi=11{c (i) =j} (μ (i) c−x (i)) =0⇒μj=∑mi=11{ C (i) =j}x (i) ∑mi=11{c (i) =j} (2) (2) ∂j∂μj=∂∂μj∑i=1m1{c (i) =j}∥x (i) −μc (i) ∥2=2∑i=1m1{c (i) =j} (μc (i) −x (i)) =0⇒μj=∑i=1m1{c (i) =j}x (i) ∑i=1m1{c (i) =j}
Fig. (3) on the left is the cluster result after the k-means of the randomly generated four groups of data subjected to Gaussian distribution; the right side is the curve of the criterion function value in each iterative process, and the algorithm converges after 16 iterations, which verifies the convergence of the algorithm from the experimental point of view. The final cluster analysis results are perfect because of the difference between the data of the given cluster. Because the random initialization of the cluster center situation is very bad, the algorithm after 16 iterations to converge, generally in 8 times within the stability.
If the sample has more than one attribute and the property is not within the same definition field, it is necessary to preprocess the sample data to prevent a property with large values from taking the lead in calculating distances. The most common is standardized processing, so that the average value of each property is 0, the variance is 1. K-means algorithm is very sensitive to the initialization of the cluster center, as shown in figure (4), I indicated 6 possible initial points in the diagram, the algorithm converges to the corresponding 6 local optimal solution, but the 2nd one is the global optimal solution. In order to avoid falling into the very poor local optimal solution (such as the 1th Local optimal solution), the common strategy is to run several K-means, each time the cluster center is randomly initialized, and finally select the cluster with the minimum criterion function.
The ultimate goal of clustering is to make the data in the same class cluster as similar as possible, while samples from the same cluster do not get as far away from each other. If we follow this principle when initializing the cluster center, we can greatly reduce the number of iterations needed to converge. The algorithm of Cluster Center initialization (2) is described, and the time complexity of the algorithm is O (m2+km) O (m2+km). We can imagine that the initialization algorithm actually selects the cluster center from the edge of the sample distribution, so as to avoid the cluster center being initialized to the more dense data, and greatly reduce the number of iterations needed to converge the algorithm. There must be a price to be paid for the harvest, this is the eternal truth, so whether the value depends on the situation.
In the standard K-means algorithm, each sample point and the updated cluster center compute the distance Euclidean distance, if the sample dimension is high, the algorithm time complexity will be very high. Some Daniel have proposed an algorithm to accelerate the K-means using triangular inequalities or tree structures to reduce unnecessary distance calculations. It is recommended to refer to the 2003 Elkan published on ICML's paper "Using The triangle inequality to accelerate K-means" and "A generalized optimization of the k-d TR EE for fast nearest neighbour search. The k-d tree acceleration K-means is used in the open source project Vlfeat. In the batch version K-means algorithm, we update the cluster center with all the data at once. But when it comes to applications that need to be processed online, processing time is the key, and the other challenge is the dynamic input of data, so it is necessary to design an online algorithm for K-means. Within the time allowed, we can process a data at one time, or we can collect a few data and then process it together. In the process of proving the convergence of the K-means algorithm, we find out the deviation of the criterion function to the cluster center μjμj, we can easily transform it into the online version algorithm (3) which uses the random gradient descent, in which the learning rate parameter Shan should be reduced gradually with the increase of processing data.
A major feature of the K-means algorithm is that each sample can only be rigidly allocated (hard assignment) to a class cluster, which is not necessarily the most reasonable. But clustering itself is a question of uncertainty, as shown in figure (5), the actual cluster of clusters is likely to overlap, then the overlap of the part of the attribution is quite controversial; given a new sample, it is exactly the same as the clustering of all the cluster centers, and what should we do. If we use the method of probability, it is more reasonable to give the probability value of the sample belonging to each class cluster like the Gaussian mixture model (Gauss mixture MODEL,GMM), which can reflect the uncertainty of the cluster to some extent.
Here are two simple applications of the K-means algorithm: Image segmentation and data compression. The goal of image segmentation is to divide the image into regions, with similar visual effects in each region. K-means for image segmentation, in fact, all pixels in the image as a sample point, the similar pixel points as possible into the same cluster, and finally formed K K region, in the display of the division of the cluster center in place of all the samples. As shown in figure (6), I chose the classic Lena image and a bird image for segmentation, the center number of each cluster K K from left to right in order to 3,6,12 3,6,12, the right side of the original. Lena images have fewer colors, and all k=3 k=3 effects are also fine, but the color of the bird image is much more complex, and the image segmentation effect is only marginally satisfactory until k=12 k=12. Image segmentation is actually a very difficult problem, K-means algorithm in this field is still too weak ...
Data compression is divided into two categories, lossless compression and lossy compression, and lossless compression requires that the data be restored exactly like the element data, while lossy compression can tolerate a certain degree of deviation between the reconstructed data and the element data. K-means algorithm for data compression can only be lossy compression, K-K the smaller the data distortion is more severe. The main idea is to use the K-means algorithm to get the distribution of K K clusters and N-n clusters in N-N sample sets, and finally we just need to store the cluster center and the cluster distribution of each sample. Assuming that the storage space of each sample is a byte, the storage space required for K k cluster centers is ka ka bytes, and the cluster allocation takes storage space as n⌈log2k⌉n⌈log2 K⌉ bytes, and the compression ratio is na/(Ka+n⌈log2k⌉) na/(ka+n⌈log2 K⌉).
The entire K-means experiment MATLAB code is here to download.
From:http://www.cnblogs.com/jeromeblog/p/3425919.html