Overview
In many practical applications, many data points need to be grouped into clusters (cluster), and the center of each cluster is calculated. This is the famous K-means algorithm.
The input to the K-means algorithm is N D-dimensional data points: x_1, ..., x_n, and the number of clusters that need to be divided by K. The result of the algorithm is that the center point of each cluster is m_1, ..., M_k, and which data points in each cluster can also be output. The algorithm first determines the initial center point location by random or heuristic search. The data can be calculated by alternating the following two steps:
1. Number of locations. Based on the current K center point, determine which data points are in the cluster where each center point resides. That is, according to the coordinates of the current center point, calculates the distance from each point to these center points, selects the shortest distance corresponding cluster is the cluster of the point;
2. Center point: Based on the coordinates of each cluster, the new centers of the cluster are averaged.
This algorithm does not necessarily converge, so it is common to select different initial points to run the algorithm multiple times.
The K-means algorithm can be parallelized. Using master-slave mode, a node is responsible for the scheduling and partitioning of data, and the other nodes are responsible for the calculation of the local central point, and send the results to the master node. Therefore, it can be implemented using MPI parallel programming. The approximate process is as follows:
1. Node P_0 divides the set of data points into P-blocks: D_1, ..., d_p, assigned to P_1, ..., p_p this p-node;
2.p_0 produces the initial K center point (m_1, ..., m_k) and broadcasts it to all nodes;
3. Node P_r calculates the distance from each point in the block d_r to the K Center point (m_1, ..., m_k), and selects the center point corresponding to the minimum value for the cluster where the point resides.
4. Node P_r calculates the sum and number of local data point set d_r in each cluster, and sends it to P_0;
5.P_0 receives all this data, calculates the new center point (m_1, ..., m_k) and broadcasts it to all nodes;
6. Repeat 3-5 steps until convergence.
Problem description
The K-means algorithm primarily solves the problem as shown in. We can see that there are some points on the left side of the graph that we can see with the naked eye that there are four point groups, but how do we find these points in a computer program? And then there's our K-means algorithm.
Algorithm overview
This algorithm is actually very simple, as shown in:
K-means Algorithm Overview
From this, we can see that a, B, C, D, and E are five points at the midpoint of the graph. The gray point is our seed point, which is the point we use to find some group. There are two seed points, so k=2.
Then, the K-means algorithm is as follows:
Randomly take K (here k=2) A seed point in the graph.
Then all points in the graph to find the distance of the K seed point, if the point pi from the seed point si nearest, then pi belongs to Si Point group. (we can see that a, B belongs to the seed point above, the c,d,e belongs to the seed point in the middle below)
Next, we want to move the seed point to the center of his "point group". (See the third step on the chart)
Then repeat steps 2nd and 3rd) until the seed point is not moved (we can see that the seed point above the fourth step of the figure aggregates the a,b,c, the seed point below aggregates the d,e).
This algorithm is very simple, but some details I want to mention, to find the formula of distance I do not say, we have a junior high school graduation level of people should know how to calculate. I'd like to focus on the "algorithm for the point group Center"
Algorithm for finding the center of Point Group
In general, you can use the average of the X/y coordinates of each point in order to find the algorithm of the Point group Center point. However, I would like to tell you about the other three formulas for the center point:
1) Minkowski Distance formula--λ can be arbitrary value, can be negative, or can be positive, or infinity.
2) Euclidean Distance formula--the case of the first formula λ=2
3) Cityblock Distance formula--the case of the first formula Λ=1
These three formulas are in demand
The heart points have some different places that we look at (for the first λ between 0-1).
K-means + + algorithm
K-means has two of the most significant flaws-all related to the initial value:
K is given beforehand, the selection of this k value is very difficult to estimate. Many times, there is no prior knowledge of how many categories a given dataset should fit into. (The ISODATA algorithm obtains the more reasonable type number K) through the automatic merging and splitting of the classes.
The K-means algorithm needs to be made with an initial random seed point, which is too important for the random seed point to have a completely different result. (The k-means++ algorithm can be used to solve this problem, it can effectively select the initial point)
I'm here to focus on the k-means++ algorithm steps:
Randomly pick a random point from our database as a "seed point".
For each point, we calculate the distance d (x) of it and the nearest "seed point" and save it in an array, and add the distances together to get the sum (D (x)).
Then, take a random value and use the weighted method to calculate the next "seed point". The implementation of this algorithm is to first take a random value that can fall in sum (d (x)), and then use random-= d (x) until its <=0, where the point is the next "seed point".
Repeat steps (2) and (3) until all the K seed points are selected.
The K-means algorithm is performed.
Algorithm application Example
Bulk image data, data is implemented in a time-logged C # code snippet:
"'
Calculation method
public class K_mean
{
Public list List {get; set;}
public double Average { get { return List.Average(); } } public double Center { get; set; } public double Change { get; private set; } public double S2 { get; set; }//标准差 public K_mean() { Change = double.MaxValue; List = new List<double>(); S2 = 1; } public void RefreshCenter() { Change = Math.Abs(Average - Center); Center = Average; } public void RefreshS2() { double tempS2 = 0; for (int j = 0; j < this.List.Count; j++) { tempS2 = tempS2 + (this.List[j] - Average) * (this.List[j] - Average); } S2 = Math.Sqrt(tempS2 / this.List.Count); } }
An application example of K-means algorithm for packet aggregation