Use Kd-tree to accelerate K-means

Last Update:2015-04-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

0. Catalogue

pre-knowledge
idea introduction
detailing
- 1 determines the center point of H
- 2 algorithm step
< Span style= "color: #ff9900;" >java implementation

1. Front-facing knowledge

This article is based on "accelerating exact k-means algorithms with geometric reasoning"
Kdtree
K-means

2. Introduction of Ideas

The K-means algorithm uses the following iterative steps to obtain the local optimal solution after initializing the center point:
A. Assigning a point x in DataSet D to the nearest center point
B. In each cluster, the re-center point
In traditional algorithms, step a needs to calculate n*k distances (n is the size of D, K is the number of clusters), and step b needs to add n number of points.
In Kdtree, each non-leaf node stores the data range information H of the data it contains.

h in two-dimensional space can be represented by a rectangle.
The * is a point, the red rectangle is the data range H

A. if the data in the node can be judged as the center point C through the range information, the calculation of the distance from the data to the center point in the node can be omitted.
If you can judge that the data in H does not belong to a center point C, you can omit the calculation of the node's data to the center point C distance.
B. when you know that the data in the node is all C, you can add the previously added statistics in h directly to the statistics of C

3. Detailed 3.1 determines the center point of H (all data in H is close to the center point and far from the other center points)

Max (the maximum value on each dimension) and min (the minimum value on each dimension) are stored in the Kdtree node to determine the range of data in the node
Center point (c1,c2,..., ck)
A. determine if there is a possibility
Calculate the minimum distance from each center point to H (refer to Kdtree nearest neighbor lookup, 5th step) d (ci,h)
If there is a minimum distance, then this CI may be the center point of H (further judgment is required)
If there is more than one minimum distance, the center point of H does not exist, and the H is divided into smaller (on the left and right trees of h) to find

The points represented by the squares are in the interior of H.
So they have the same minimum distance to H, which is 0.
There is no center point for this H

B. further judgment, whether CI is a central point

	L12 The median line for C1 and C2, and H all fell on the C1 side, So all the points in H are closer to C1 than C2, which is called C1 better than C2 For C1 and C3, a part of H falls on the C1, and part of it falls on the C3 C1 not better than C3
Determine if C1 is better than C3: Orientation Amount v= (C3-C1), find the point P belongs to H, so that the <v,p> inner product maximum V for each dimension (+,-), p is as large as possible on the x-axis and as small as possible on the y-axis, taking to P13 P13 close to C3, so C1 is not superior to C3

If CI is superior to other points, it can be determined that CI is the center point of H, otherwise CI is not the center point of H;
Although CI is not the center point of H, the information obtained, such as CI is better than C2, can be excluded from the central point candidate List of the C2 from H's subtree

3.2 Algorithm Steps

Special properties for each non-leaf node in Kdtree:
sumofpoints:the M-dimensional vector (M is the dimension of the data), and the value of the I dimension is the number I dimension of the data in the node and
N: The number of data in the node

Input: Kdtree,c including center point (c1,c2,..., ck)

Output: Cnew New K-Center point

Node=kdtree.root
Centers=k*m arrays//each row stores data belonging to this center point and
Datacount=k*1 Array//Store the number of data belonging to this center point

UPDATE (NODE,C):
IF node is a leaf node
Traverse computation to get the node closest to nodes CT
Centers[t]+=node.value;
Datacount[t]+=1;
RETURN;

For (CI in C) calculation D (ci,node.h)
IF There are multiple minimum d (ci,node.h)
UPDATE (NODE.LEFT,C);
UPDATE (NODE.RIGHT,C);
RETURN;
Suppose that the smallest of D (ci,node.h) is CT
ctover=[]//storage Inferior to CT
For (CIS in C (except CT)) IF (CT is superior to CI) ctover. ADD (CI)
IF (LEN (ctover) =len (C)-1)//CT is superior to other center points
centers[t]+=node.sumofpoints;
DATACOUNT[T]+=NODE.N;
RETURN;
Ct= (CI in C and CI not in ctover)// Exclude center Point from CT difference
UPDATE (NODE.LEFT,CT);
UPDATE (NODE.RIGHT,CT);
RETURN;

4.java implementations

Use Kd-tree to accelerate K-means

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Kd-tree to accelerate K-means

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support