0. Catalogue
- pre-knowledge
- idea introduction
- detailing
- 1 determines the center point of H
- 2 algorithm step
- < Span style= "color: #ff9900;" >java implementation
1. Front-facing knowledge
This article is based on "accelerating exact k-means algorithms with geometric reasoning"
Kdtree
K-means
2. Introduction of Ideas
The K-means algorithm uses the following iterative steps to obtain the local optimal solution after initializing the center point:
A. Assigning a point x in DataSet D to the nearest center point
B. In each cluster, the re-center point
In traditional algorithms, step a needs to calculate n*k distances (n is the size of D, K is the number of clusters), and step b needs to add n number of points.
In Kdtree, each non-leaf node stores the data range information H of the data it contains.
|
h in two-dimensional space can be represented by a rectangle. The * is a point, the red rectangle is the data range H |
A. if the data in the node can be judged as the center point C through the range information, the calculation of the distance from the data to the center point in the node can be omitted.
If you can judge that the data in H does not belong to a center point C, you can omit the calculation of the node's data to the center point C distance.
B. when you know that the data in the node is all C, you can add the previously added statistics in h directly to the statistics of C
3. Detailed 3.1 determines the center point of H (all data in H is close to the center point and far from the other center points)
Max (the maximum value on each dimension) and min (the minimum value on each dimension) are stored in the Kdtree node to determine the range of data in the node
Center point (c1,c2,..., ck)
A. determine if there is a possibility
Calculate the minimum distance from each center point to H (refer to Kdtree nearest neighbor lookup, 5th step) d (ci,h)
If there is a minimum distance, then this CI may be the center point of H (further judgment is required)
If there is more than one minimum distance, the center point of H does not exist, and the H is divided into smaller (on the left and right trees of h) to find
|
The points represented by the squares are in the interior of H. So they have the same minimum distance to H, which is 0. There is no center point for this H |
B. further judgment, whether CI is a central point
|
L12 The median line for C1 and C2, and H all fell on the C1 side, So all the points in H are closer to C1 than C2, which is called C1 better than C2 For C1 and C3, a part of H falls on the C1, and part of it falls on the C3 C1 not better than C3 |
Determine if C1 is better than C3: Orientation Amount v= (C3-C1), find the point P belongs to H, so that the <v,p> inner product maximum V for each dimension (+,-), p is as large as possible on the x-axis and as small as possible on the y-axis, taking to P13 P13 close to C3, so C1 is not superior to C3 |
If CI is superior to other points, it can be determined that CI is the center point of H, otherwise CI is not the center point of H;
Although CI is not the center point of H, the information obtained, such as CI is better than C2, can be excluded from the central point candidate List of the C2 from H's subtree
3.2 Algorithm Steps
Special properties for each non-leaf node in Kdtree: sumofpoints:the M-dimensional vector (M is the dimension of the data), and the value of the I dimension is the number I dimension of the data in the node and N: The number of data in the node |
| Input: Kdtree,c including center point (c1,c2,..., ck) |
| Output: Cnew New K-Center point |
Node=kdtree.root Centers=k*m arrays//each row stores data belonging to this center point and Datacount=k*1 Array//Store the number of data belonging to this center point
|
UPDATE (NODE,C): IF node is a leaf node Traverse computation to get the node closest to nodes CT Centers[t]+=node.value; Datacount[t]+=1; RETURN;
For (CI in C) calculation D (ci,node.h) IF There are multiple minimum d (ci,node.h) UPDATE (NODE.LEFT,C); UPDATE (NODE.RIGHT,C); RETURN; Suppose that the smallest of D (ci,node.h) is CT ctover=[]//storage Inferior to CT For (CIS in C (except CT)) IF (CT is superior to CI) ctover. ADD (CI) IF (LEN (ctover) =len (C)-1)//CT is superior to other center points centers[t]+=node.sumofpoints; DATACOUNT[T]+=NODE.N; RETURN; Ct= (CI in C and CI not in ctover)// Exclude center Point from CT difference UPDATE (NODE.LEFT,CT); UPDATE (NODE.RIGHT,CT); RETURN; |
4.java implementations
Use Kd-tree to accelerate K-means