In data mining, the K-means algorithm is a kind of cluster analysis algorithm, which is mainly to calculate the data aggregation algorithm, mainly by continuously taking the nearest mean value of the seed point algorithm.
Basic k mean: Select k initial centroid, where K is the user-specified parameter, that is, the number of clusters expected. Each point is assigned to the nearest centroid in each loop, and a set of points assigned to the same centroid forms a cluster. Then, the centroid of each cluster is updated according to the points assigned to the cluster. Repeat the assignment and update operations until the centroid does not change significantly.
To define the "recent" concept between data points in a two-dimensional space, we use the square of Euclidean distance, that is, point A (x1,y1) and point B (x2,y3) are dist (A, b) = (X1-X2) (Y1-y2) 2. In addition, we use squared and SSE as the global objective function, that is, to minimize the square sum of Euclidean distances from each point to the nearest centroid. In the case of setting this SSE, it can be mathematically proved that the centroid of the cluster is the average of all the data points within the cluster.
Problem
The K-means algorithm primarily solves the problem as shown in. We can see that there are some points on the left side of the graph that we can see with the naked eye that there are four point groups, but how do we find these points in a computer program? So there's our K-means algorithm (Wikipedia link)
K-means to solve the problem
Algorithm overview
This algorithm is actually very simple, as shown in:
From which we can see thatA,b,c,d,e is five points in the figure. The gray point is our seed point, which is the point we use to find some group . There are two seed points, so k=2.
Then, the K-means algorithm is as follows:
- Randomly take K (here k=2) A seed point in the graph.
- Then all points in the graph to find the distance of the K seed point, if the point pi from the seed point si nearest, then pi belongs to Si Point group. (we can see that a, B belongs to the seed point above, the c,d,e belongs to the seed point in the middle below)
- Next, we want to move the seed point to the center of his "point group". (See the third step on the chart)
- Then repeat steps 2nd and 3rd) until the seed point is not moved (we can see that the seed point above the fourth step of the figure aggregates the a,b,c, the seed point below aggregates the d,e).
This algorithm is very simple, but some details I want to mention, to find the formula of distance I do not say, we have a junior high school graduation level of people should know how to calculate. I'd like to focus on the "algorithm for Point group Center".
Algorithm for finding the center of Point Group
In general, you can use the average of the X/y coordinates of each point in order to find the algorithm of the Point group Center point. However, I would like to tell you about the other three formulas for the center point:
1) Minkowski distance formula--λ can be arbitrary value, can be negative, or can be positive, or infinity.
2) Euclidean distance formula --the case of the first formula λ=2
3) Cityblock distance formula --the case of the first formula Λ=1
The center point of the three formulas has some different places,
Let's see (for the first λ between 0-1).
(1) Minkowski Distanc (2)Euclidean Distance (3) cityblock Distance
The main idea of the above is how they approach the center, the first figure in a star-shaped way, the second figure in concentric circles, the third graph in a diamond way.
The implementation code is as follows:
[Python]View PlainCopy
- # Scoding=utf-8
- Import Pylab as Pl
- points = [[Int (Eachpoint.split ("#") [0]), int (Eachpoint.split ("#") [1])] for eachpoint in open ( "Points","R")]
- # Specify three initial centroid
- CurrentCenter1 = [[+]; currentCenter2 = [+]; currentCenter3 = [ +]
- Pl.plot ([currentcenter1[0]], [currentcenter1[1]],' OK ')
- Pl.plot ([currentcenter2[0]], [currentcenter2[1]],' OK ')
- Pl.plot ([currentcenter3[0]], [currentcenter3[1]],' OK ')
- # Record the update trajectory of the centroid of each cluster after each iteration
- Center1 = [CurrentCenter1]; Center2 = [CurrentCenter2]; Center3 = [CurrentCenter3]
- # of three clusters
- Group1 = []; Group2 = []; Group3 = []
- For runtime in range:
- Group1 = []; Group2 = []; Group3 = []
- For Eachpoint in points:
- # Calculate the distance from each point to three centroid
- Distance1 = POW (ABS (eachpoint[0]-currentcenter1[0]),2) + POW (ABS (eachpoint[1]-currentcenter1[1 ]),2)
- Distance2 = POW (ABS (eachpoint[0]-currentcenter2[0]),2) + POW (ABS (eachpoint[1]-currentcenter2[1 ]),2)
- Distance3 = POW (ABS (eachpoint[0]-currentcenter3[0]),2) + POW (ABS (eachpoint[1]-currentcenter3[ 1]),2)
- # Assign the point to the cluster where the centroid closest to it is located
- Mindis = min (distance1,distance2,distance3)
- if (Mindis = = Distance1):
- Group1.append (Eachpoint)
- elif (Mindis = = Distance2):
- Group2.append (Eachpoint)
- Else:
- Group3.append (Eachpoint)
- # After assigning all the points, update the centroid of each cluster
- CurrentCenter1 = [Sum ([eachpoint[0] for eachpoint in group1])/len (group1), sum ([eachpoint[1] for Eachpoint in group1])/len (group1)]
- CurrentCenter2 = [Sum ([eachpoint[0] for eachpoint in group2])/len (group2), sum ([eachpoint[1] for Eachpoint in group2])/len (group2)]
- CurrentCenter3 = [Sum ([eachpoint[0] for eachpoint in Group3])/len (GROUP3), sum ([eachpoint[1] for Eachpoint in Group3])/len (GROUP3)]
- # Record The update of the heart
- Center1.append (CurrentCenter1)
- Center2.append (CurrentCenter2)
- Center3.append (CurrentCenter3)
- # Print all the dots and color to identify the cluster that the point belongs to
- Pl.plot ([eachpoint[0] for eachpoint in group1], [eachpoint[1] for eachpoint in group1], ' or ' )
- Pl.plot ([eachpoint[0] for eachpoint in group2], [eachpoint[1] for eachpoint in group2], ' Oy ' )
- Pl.plot ([eachpoint[0] for eachpoint in Group3], [eachpoint[1] for eachpoint in Group3], ' og ')
- # Print the update trajectory for the centroid of each cluster
- For center in [Center1,center2,center3]:
- Pl.plot ([eachcenter[0] for eachcenter in center], [eachcenter[1] for eachcenter in center], ' K ')
- Pl.show ()
Clustering algorithm: K mean value