K-means Clustering algorithm

Source: Internet
Author: User

Transfer from Jerrylead's blog

K-means is also the simplest of the clustering algorithm, but the idea contained in it is not general. The first I used and implemented this algorithm is in the study of Grandpa Han's data Mining book, the book is more attention to application. After reading this handout from Andrew Ng, I had some idea of the EM thought behind K-means.

Clustering belongs to unsupervised learning, the former regression, naive Bayes, SVM and so on have category label y, that is, the sample has been given the classification of examples. There is no y in the cluster sample, only feature x, such as the assumption that the stars in the universe can be represented as point sets in three-dimensional space. The purpose of clustering is to find the potential category Y for each sample x and put together a sample x of the same category Y. For example, the star above, the result is a cluster of clusters, the cluster inside the point of the distance between the close, the stars between the star distance is relatively far away.

In the clustering problem, the training samples given to us are, each, without Y.

K-means algorithm is to cluster the sample into K clusters (cluster), the specific algorithm is described as follows:

1, randomly selected K cluster centroid point (Cluster centroids) is.

2, repeat the following process until convergence {

For each example I, calculate the class it should belong to

For each class J, recalculate the centroid of the class

}

K is the number of clusters we have given beforehand, representing the nearest class in the example I and K classes, with a value of 1 to K. The centroid represents our guess of the central point of the sample that belongs to the same class, and the cluster model is used to explain that all the stars are to be clustered into K-clusters, first randomly selecting the points (or K-Stars) of the K-universe as the centroid of the K-clusters, and then the first step for each star to calculate its distance to the K Then select the nearest cluster to take, so that after the first step each star has its own cluster; the second step is to recalculate its centroid (averaging all the stars in it) for each cluster. Repeats the first and second steps of the iteration until the centroid is constant or changes very little.

Shows the effect of K-means clustering on n sample points, where K takes 2.

K-means Face The first problem is how to ensure convergence, the previous algorithm emphasizes that the end condition is convergence, can prove that K-means can guarantee convergence. Below we characterize the convergence, we define the distortion function (distortion functions) as follows:

The J function represents the sum of squares of the distance of each sample point to its centroid. The K-means is to adjust the J to the minimum. Assuming that the current J does not have a minimum value, you can first fix the centroid of each class, adjust the category to which each sample belongs to reduce the J function, and, similarly, fix and adjust the centroid of each class to reduce the J. These two processes are the process of making J monotonically decreasing in the inner loop. When J decreases to a minimum, and C also converges. (in theory, there can be multiple sets of different and C values that allow J to get the minimum value, but this phenomenon is actually very rare).

Because the distortion function j is a non-convex function, it means that we can not guarantee that the minimum value obtained is the global minimum, that is to say, k-means the initial position of the heart to choose a cold, but in general, K-means to achieve the local optimal has satisfied the demand. But if you are afraid of getting into the local optimal, then you can choose different initial values to run multiple times K-means, and then take the smallest j corresponding and C output.

Under the k-means of the relationship with EM, first back to the initial problem, we aim to divide the sample into K-class, in fact, it is to ask for each sample x of the implied category Y, and then use the implied category to classify X. Since we do not know the class Y beforehand, we can first assume a Y for each example, but how do we know if the assumptions are correct? How do you evaluate a hypothetical good or bad? We use the maximum likelihood estimate of the sample to measure, here is the joint distribution of X and y p (x, y). If Y is found to be the largest of P (x, y), then the y we find is the best category for sample X, and X is clustered in handy. But the first time we specify the y does not necessarily make p (x, y) the largest, and P (x, y) also depends on other unknown parameters, of course, given the case of Y, we can adjust the other parameters so that P (x, y) maximum. But after adjusting the parameters, we find that better y can be specified, then we re-specify Y and then calculate the maximum parameter p (x, y), iterate until there is no better y to specify.

This process has several difficulties, the first how to assume y? Is each sample hard assigned a Y or different Y has different probabilities, how the probability is measured. The second is how to estimate p (x, Y), p (x, y) may also depend on many other parameters, how to adjust the parameters inside to make P (x, y) the largest. These questions are answered in a later chapter.

Here is just pointing out the idea of EM, e-step is to estimate the implied class Y expectations, M step to adjust the other parameters so that in the case of a given category Y, maximum likelihood estimate p (x, y) can reach the maximum value. Then, in the case of other parameters, re-estimate y, cycle, until convergence.

The above explanation is somewhat puzzling, corresponding to the K-means, we do not know at the outset that each sample corresponds to the implied variable is the best category. At the very beginning you can specify one to give it, then in order to let P (x, y) max (here is to let J minimum), we find in the given C case, J minimum (the other unknown parameters mentioned earlier), however, it is found that there can be better (centroid and sample distance of the smallest category) assigned to the sample, then get re , the process begins to repeat until there is no better designation. So from the K-means we can see that it is actually the embodiment of EM, e step is to determine the implied class variables, M-Step update other parameters to minimize J. The implied class variable specifies a method that is special, a hard designation, and a hard selection from the K category, rather than assigning a different probability to each category. The general idea is still an iterative optimization process, there are objective functions, there are parametric variables, just a number of hidden variables, to determine other parameters to estimate the implied variables, and then determine the hidden variables to estimate other parameters until the objective function is optimal.

K-means Clustering algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.