Random Talk Clustering (1): K-means

Source: Internet
Author: User

Long time no blog, one is a blog offline for a while, and rent DreamHost things have not been fixed, and the other is not too much time, every day to run to the laboratory. Now the main toss machine learning related things, because a lot of things do not understand, so usually also find some information to see. According to my previous update speed, so long time not to write blog is bound to be stuffy, so I also think it is not regularly to sort out what they know, put on the blog, a comb always help deepen understanding, and secondly also count sharing knowledge. Well, let's talk about it from clustering.

Clustering Chinese translation as "clustering", simply said to be similar to a group of things, with the classification (classification), for a classifier, usually need you to tell it "this thing is divided into XXX class" such as some examples, ideally, a Classifier will focus on "learning" from the training it receives, thus having the ability to classify unknown data, the process of providing training data is often called supervised learning (supervised learning), while in clustering we do not care what a class is, The goal we need to achieve is to get something similar together, so a clustering algorithm usually needs to know how to calculate the similarity to get started, so clustering usually doesn't need to use training data to learn, which is called in machine learning. Unsupervised learning (unsupervised learning).

To give a simple example: now there are a group of pupils, you have to divide them into groups, so that members of the group are as similar as possible, while the group is a little different. The final result depends on your definition of "similarity", for example, you decide that boys and boys are similar, girls and girls are similar, and boys and girls are very different ", so you actually use a possible two values" male "and" female "discrete variable to represent the original pupil, We often call such variables "features". In fact, in this case, all the pupils are mapped to one of the two points, which has naturally formed two groups and does not need to be specifically clustered. Another possibility is to use the "height" feature. I am in primary school, every Friday in the playground for a lecture on the ground in accordance with the location of the place and distance to queue, so that after the end can trooped home. In addition to mapping things to a single feature, a common practice is to extract n features at the same time and put them together to form an n-dimensional vector, resulting in a mapping from the original data set to the n-dimensional vector space-you always need to explicitly or implicitly complete such a process, Because many machine-learning algorithms need to work in a vector space.

So let's go back to the question of clustering, and leave aside what the original data is, assuming we've mapped it to a Euclidean space, using a two-dimensional space for easy display, as shown in:

From the approximate shape of the data points, it can be seen that they are roughly clustered into three cluster, of which two are compact and the rest is loosely. Our goal is to group these data so that we can distinguish between different clusters of data, and if they are labeled in different colors, it looks like this:

So how does the computer accomplish this task? Of course, computers are not advanced enough to be able to "see through shapes", but for the clustering of points in such N dimensional Euclidean, there is a very simple classical algorithm, the K-means mentioned in the title of this article. Before we introduce the specific steps of K-means, let's take a look at a basic hypothesis about the data that needs to be clustered: For each cluster, we can select a central point (center) so that the distance from all points in the cluster to the center point is less than the other The distance from the center of the cluster. Although the data obtained in the actual situation is not guaranteed to always satisfy such constraints, it is usually the best result we can achieve, and those errors are usually inherent or the problem itself is non-functional. For example, the two Gaussian distributions shown, randomly extracting some data points from two distributions, are mixed together, and now you want to separate the data points that are mixed together according to the distribution that they are generating:

Since the two distributions themselves overlap a large part of each other, for example, for data points 2.5来 says that the probability that it is generated by two distributions is equal, and that what you do is only a guess; a slightly better case is 2, and usually we classify it as the left-hand distribution, because the probability is greater, At this point, however, the probability that it is generated by the distribution on the right is still relatively large, and we still have a chance to guess wrong. And the whole shadow is the smallest probability that we can get the wrong guess, which comes from the irreducible nature of the problem and cannot be avoided. Therefore, we regard this assumption that K-means relies on as reasonable.

Based on such a hypothesis, we will then export K-means to optimize the objective function: Set we have N number of points need to be divided into K cluster, K-means to do is to minimize

This function, where the data point n is classified to cluster K at the time of 1, otherwise 0. It is not easy to find and minimize, but we can take the iterative approach: first fixed, select the best, it is easy to see, as long as the data points to the closest to his center can be guaranteed to the smallest. The next step is to fix it and then to find the optimal. The derivation and the derivative equal to zero, it is easy to get the minimum time should be satisfied:

That is, the value should be the average of all data points in the cluster K. Since each iteration is the smallest value to be taken, it will only continue to decrease (or not change) without increasing, which guarantees that the K-means will eventually reach a minimum value. Although K-means is not guaranteed to always be able to get the global optimal solution, but for such a problem, like K-means this complexity of the algorithm, such a result is very good.

Let's summarize the specific steps of the K-means algorithm:

    1. Select the initial value of the K center. This process is usually based on the specific problem of some heuristic selection method, or in most cases the use of random choice. Because the previous said K-means can not guarantee the global optimal, and whether convergence to the global optimal solution in fact and the initial value of the selection has a great relationship, so sometimes we will select the initial value of the run K-means, and take the best one of the results.
    2. Each data point is categorized into the cluster represented by the nearest center point.
    3. A new center point for each cluster is calculated using a formula.
    4. Repeat the second step until the maximum number of steps has been iterated or the value of the front and back is less than one threshold.

According to this step to write a K-means implementation is actually quite easy, in SciPy or Matlab has included the built-in K-means implementation, but in order to see the specific effect of each iteration of K-means, we may wish to implement it ourselves, the code is as follows (need to install SciPy and Matplotlib):

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465 6667686970717273747576777879808182838485868788899091
#!/usr/bin/python from __future__ import with_statementimport cpickle as Picklefrom matplotlib import Pyplotfrom NumPy import Zeros, array, tilefrom scipy.linalg import normimport numpy.matlib as Mlimport random def Kmeans (X, K, O Bserver=none, Threshold=1e-15, maxiter=300): N = Len (X) labels = zeros (N, dtype=int) centers = Array (Random.sampl E (X, k)) iter = 0  def calc_j (): sum = 0 for I in Xrange (N): Sum + = Norm (x[i]-centers[ Labels[i]]) return sum  def distmat (x, Y): n = Len (x) m = Len (Y) xx = Ml.sum (x*x, axis =1) yy = Ml.sum (y*y, axis=1) xy = Ml.dot (X, y.t)   return tile (xx, (M, 1)). T+tile (yy, (n, 1))-2*xy  Jprev = Calc_j () while True: # NOTIFY the Observer if observer are not None:observer (ITER, labels, centers)   # Calculate distance from X to all center # distance _matrix is only available in scipy newer than 0.7       # dist = Distance_matrix (x, centers) dist = Distmat (x, Centers) # assign X to Nearst center LA BELs = Dist.argmin (axis=1) # Re-calculate Each Center for J in Range (k): Idx_j = (Labels = = j). No Nzero () centers[j] = X[idx_j].mean (axis=0)   J = Calc_j () iter + = 1  if Jprev-j & Lt    Threshold:break Jprev = J if iter >= maxiter:break  # Final notification If observer is not none:observer (ITER, labels, centers)  if __name__ = = ' __main__ ': # load previously GE        Nerated points with open (' cluster.pkl ') as Inf:samples = Pickle.load (inf) N = 0 for SMP in samples: N + = Len (smp[0]) X = Zeros ((N, 2)) Idxfrm = 0 for i in range (len (samples)): Idxto = idxfrm + len (samples I [0]) x[idxfrm:idxto, 0] = samples[i][0] x[idxfrm:idxto, 1] = samples[i][1] Idxfrm = idxto  de F Observer (ITER,Labels, centers): print "iter%d."% iter colors = Array ([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) pyplot.pl OT (hold=false) # Clear previous plot pyplot.hold (True)   # Draw points Data_colors=[colors[lbl] For LBL in labels] Pyplot.scatter (x[:, 0], x[:, 1], c=data_colors, alpha=0.5) # Draw Centers Pyplot.s Catter (centers[:, 0], centers[:, 1], s=200, c=colors)   pyplot.savefig (' kmeans/iter_%02d.png '% iter, format= ' p Ng ')   Kmeans (X, 3, Observer=observer)

The code is a bit long, but it's not as easy as Matlab to do this with Python, and the actual K-means code is just 41 to 47 lines. The first 3 center points are randomly initialized, all data points are not clustered, and all are marked red by default, as shown in:

Then go to the first iteration: the color of each data point is based on the initial center point position, which is the work done in line 41st to 43rd of the code, and then 45 to 47 rows Recalculate 3 center points, as shown in the results:

As you can see, because the initial center point is randomly selected, the result is not very good, and the result of the next iteration is:

You can see that the approximate shape has come out. After two iterations, the results are basically convergent and the final result is as follows:

However, as mentioned above, K-means is not omnipotent, although many times can converge to a better result, but also have bad luck will converge to a person dissatisfied with the local optimal solution, such as the use of the following several initial center points:

Will eventually converge to such a result:

Have to admit that this is not a good result. In most cases, however, the results given by K-means are still satisfactory, and are a simple and efficient way to apply a wide range of clustering.

Update 2010.04.25: A lot of people ask me to cluster.pkl, I simply upload it up, in fact, it is very easy to generate their own, click here to download.

Random Talk Clustering (1): K-means

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.