Use Kd-tree to accelerate K-means

Source: Internet
Author: User

0. Catalogue

    • pre-knowledge
    • idea introduction
    • detailing
      • 1 determines the center point of H
      • 2 algorithm step
    • < Span style= "color: #ff9900;" >java implementation

1. Front-facing knowledge

This article is based on "accelerating exact k-means algorithms with geometric reasoning"
Kdtree
K-means

2. Introduction of Ideas

The K-means algorithm uses the following iterative steps to obtain the local optimal solution after initializing the center point:
A. Assigning a point x in DataSet D to the nearest center point
B. In each cluster, the re-center point
In traditional algorithms, step a needs to calculate n*k distances (n is the size of D, K is the number of clusters), and step b needs to add n number of points.
In Kdtree, each non-leaf node stores the data range information H of the data it contains.

h in two-dimensional space can be represented by a rectangle.
The * is a point, the red rectangle is the data range H

A. if the data in the node can be judged as the center point C through the range information, the calculation of the distance from the data to the center point in the node can be omitted.
If you can judge that the data in H does not belong to a center point C, you can omit the calculation of the node's data to the center point C distance.
B. when you know that the data in the node is all C, you can add the previously added statistics in h directly to the statistics of C

3. Detailed 3.1 determines the center point of H (all data in H is close to the center point and far from the other center points)

Max (the maximum value on each dimension) and min (the minimum value on each dimension) are stored in the Kdtree node to determine the range of data in the node
Center point (c1,c2,..., ck)
A. determine if there is a possibility
Calculate the minimum distance from each center point to H (refer to Kdtree nearest neighbor lookup, 5th step) d (ci,h)
If there is a minimum distance, then this CI may be the center point of H (further judgment is required)
If there is more than one minimum distance, the center point of H does not exist, and the H is divided into smaller (on the left and right trees of h) to find

The points represented by the squares are in the interior of H.
So they have the same minimum distance to H, which is 0.
There is no center point for this H

B. further judgment, whether CI is a central point

L12 The median line for C1 and C2, and H all fell on the C1 side,
So all the points in H are closer to C1 than C2, which is called C1 better than C2

For C1 and C3, a part of H falls on the C1, and part of it falls on the C3
C1 not better than C3
Determine if C1 is better than C3:
Orientation Amount v= (C3-C1), find the point P belongs to H, so that the <v,p> inner product maximum
V for each dimension (+,-), p is as large as possible on the x-axis and as small as possible on the y-axis, taking to P13
P13 close to C3, so C1 is not superior to C3

If CI is superior to other points, it can be determined that CI is the center point of H, otherwise CI is not the center point of H;
Although CI is not the center point of H, the information obtained, such as CI is better than C2, can be excluded from the central point candidate List of the C2 from H's subtree

3.2 Algorithm Steps

Special properties for each non-leaf node in Kdtree:
sumofpoints:the M-dimensional vector (M is the dimension of the data), and the value of the I dimension is the number I dimension of the data in the node and
N: The number of data in the node
Input: Kdtree,c including center point (c1,c2,..., ck)
Output: Cnew New K-Center point
Node=kdtree.root
Centers=k*m arrays//each row stores data belonging to this center point and
Datacount=k*1 Array//Store the number of data belonging to this center point
UPDATE (NODE,C):
IF node is a leaf node
Traverse computation to get the node closest to nodes CT
Centers[t]+=node.value;
Datacount[t]+=1;
RETURN;

For (CI in C) calculation D (ci,node.h)
IF There are multiple minimum d (ci,node.h)
UPDATE (NODE.LEFT,C);
UPDATE (NODE.RIGHT,C);
RETURN;
Suppose that the smallest of D (ci,node.h) is CT
ctover=[]//storage Inferior to CT
For (CIS in C (except CT)) IF (CT is superior to CI) ctover. ADD (CI)
IF (LEN (ctover) =len (C)-1)//CT is superior to other center points
centers[t]+=node.sumofpoints;
DATACOUNT[T]+=NODE.N;
RETURN;
Ct= (CI in C and CI not in ctover)// Exclude center Point from CT difference
UPDATE (NODE.LEFT,CT);
UPDATE (NODE.RIGHT,CT);
RETURN;
4.java implementations





Use Kd-tree to accelerate K-means

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.