Clustering algorithm: K mean value

Source: Internet
Author: User

In data mining, the K-means algorithm is a kind of cluster analysis algorithm, which is mainly to calculate the data aggregation algorithm, mainly by continuously taking the nearest mean value of the seed point algorithm.

Basic k mean: Select k initial centroid, where K is the user-specified parameter, that is, the number of clusters expected. Each point is assigned to the nearest centroid in each loop, and a set of points assigned to the same centroid forms a cluster. Then, the centroid of each cluster is updated according to the points assigned to the cluster. Repeat the assignment and update operations until the centroid does not change significantly.

To define the "recent" concept between data points in a two-dimensional space, we use the square of Euclidean distance, that is, point A (x1,y1) and point B (x2,y3) are dist (A, b) = (X1-X2) (Y1-y2) 2. In addition, we use squared and SSE as the global objective function, that is, to minimize the square sum of Euclidean distances from each point to the nearest centroid. In the case of setting this SSE, it can be mathematically proved that the centroid of the cluster is the average of all the data points within the cluster.

Problem

The K-means algorithm primarily solves the problem as shown in. We can see that there are some points on the left side of the graph that we can see with the naked eye that there are four point groups, but how do we find these points in a computer program? So there's our K-means algorithm (Wikipedia link)

K-means to solve the problem

Algorithm overview

This algorithm is actually very simple, as shown in:

From which we can see thatA,b,c,d,e is five points in the figure. The gray point is our seed point, which is the point we use to find some group . There are two seed points, so k=2.

Then, the K-means algorithm is as follows:

    1. Randomly take K (here k=2) A seed point in the graph.
    2. Then all points in the graph to find the distance of the K seed point, if the point pi from the seed point si nearest, then pi belongs to Si Point group. (we can see that a, B belongs to the seed point above, the c,d,e belongs to the seed point in the middle below)
    3. Next, we want to move the seed point to the center of his "point group". (See the third step on the chart)
    4. Then repeat steps 2nd and 3rd) until the seed point is not moved (we can see that the seed point above the fourth step of the figure aggregates the a,b,c, the seed point below aggregates the d,e).

This algorithm is very simple, but some details I want to mention, to find the formula of distance I do not say, we have a junior high school graduation level of people should know how to calculate. I'd like to focus on the "algorithm for Point group Center".

Algorithm for finding the center of Point Group

In general, you can use the average of the X/y coordinates of each point in order to find the algorithm of the Point group Center point. However, I would like to tell you about the other three formulas for the center point:

1) Minkowski distance formula--λ can be arbitrary value, can be negative, or can be positive, or infinity.

2) Euclidean distance formula --the case of the first formula λ=2

3) Cityblock distance formula --the case of the first formula Λ=1

The center point of the three formulas has some different places,

Let's see (for the first λ between 0-1).

(1) Minkowski Distanc (2)Euclidean Distance (3) cityblock Distance

The main idea of the above is how they approach the center, the first figure in a star-shaped way, the second figure in concentric circles, the third graph in a diamond way.

The implementation code is as follows:

[Python]View PlainCopy
  1. # Scoding=utf-8
  2. Import Pylab as Pl
  3. points = [[Int (Eachpoint.split ("#") [0]), int (Eachpoint.split ("#") [1])] for eachpoint in open ( "Points","R")]
  4. # Specify three initial centroid
  5. CurrentCenter1 = [[+]; currentCenter2 = [+]; currentCenter3 = [ +]
  6. Pl.plot ([currentcenter1[0]], [currentcenter1[1]],' OK ')
  7. Pl.plot ([currentcenter2[0]], [currentcenter2[1]],' OK ')
  8. Pl.plot ([currentcenter3[0]], [currentcenter3[1]],' OK ')
  9. # Record the update trajectory of the centroid of each cluster after each iteration
  10. Center1 = [CurrentCenter1]; Center2 = [CurrentCenter2]; Center3 = [CurrentCenter3]
  11. # of three clusters
  12. Group1 = []; Group2 = []; Group3 = []
  13. For runtime in range:
  14. Group1 = []; Group2 = []; Group3 = []
  15. For Eachpoint in points:
  16. # Calculate the distance from each point to three centroid
  17. Distance1 = POW (ABS (eachpoint[0]-currentcenter1[0]),2) + POW (ABS (eachpoint[1]-currentcenter1[1 ]),2)
  18. Distance2 = POW (ABS (eachpoint[0]-currentcenter2[0]),2) + POW (ABS (eachpoint[1]-currentcenter2[1 ]),2)
  19. Distance3 = POW (ABS (eachpoint[0]-currentcenter3[0]),2) + POW (ABS (eachpoint[1]-currentcenter3[ 1]),2)
  20. # Assign the point to the cluster where the centroid closest to it is located
  21. Mindis = min (distance1,distance2,distance3)
  22. if (Mindis = = Distance1):
  23. Group1.append (Eachpoint)
  24. elif (Mindis = = Distance2):
  25. Group2.append (Eachpoint)
  26. Else:
  27. Group3.append (Eachpoint)
  28. # After assigning all the points, update the centroid of each cluster
  29. CurrentCenter1 = [Sum ([eachpoint[0] for eachpoint in group1])/len (group1), sum ([eachpoint[1] for Eachpoint in group1])/len (group1)]
  30. CurrentCenter2 = [Sum ([eachpoint[0] for eachpoint in group2])/len (group2), sum ([eachpoint[1] for Eachpoint in group2])/len (group2)]
  31. CurrentCenter3 = [Sum ([eachpoint[0] for eachpoint in Group3])/len (GROUP3), sum ([eachpoint[1] for Eachpoint in Group3])/len (GROUP3)]
  32. # Record The update of the heart
  33. Center1.append (CurrentCenter1)
  34. Center2.append (CurrentCenter2)
  35. Center3.append (CurrentCenter3)
  36. # Print all the dots and color to identify the cluster that the point belongs to
  37. Pl.plot ([eachpoint[0] for eachpoint in group1], [eachpoint[1] for eachpoint in group1], ' or ' )  
  38. Pl.plot ([eachpoint[0] for eachpoint in group2], [eachpoint[1] for eachpoint in group2], ' Oy ' )  
  39. Pl.plot ([eachpoint[0] for eachpoint in Group3], [eachpoint[1] for eachpoint in Group3], ' og ')
  40. # Print the update trajectory for the centroid of each cluster
  41. For center in [Center1,center2,center3]:
  42. Pl.plot ([eachcenter[0] for eachcenter in center], [eachcenter[1] for eachcenter in center], ' K ')
  43. Pl.show ()

Clustering algorithm: K mean value

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.