From Kmeans to KD tree search

Last Update:2015-01-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Kmeans algorithm is a very common clustering algorithm.

The procedure of the algorithm is as follows:

(1) Through the problem analysis, determine the number of categories to be clustered K; (generally it is difficult to determine directly, you can use the method of cross-validation, and so on.) ）

(2) According to the problem type, determine the calculation method of the similarity between the calculated data;

(3) randomly select K data from the data set as a cluster center;

(4) Calculate the similarity between each data and the center of the cluster by using the formula of similarity degree. Select the cluster center with the most similarity, as the class to which the data point belongs.

(5) Use (4) to determine the category of each data point and recalculate each new cluster center;

(6) Repeat steps (4) and (5) until all cluster centers are stable or meet the stop requirements set previously (for example, a threshold less than the change in the data category).

Kmeans clustering can show better results in the case of small amount of data. However, in the case of large data volumes and high feature dimensions for each data, it is very costly to perform kmeans clustering. Because, each clustering process needs to calculate the entire data space. The calculated amount is nxk. where n is the number of data, K is the number of cluster categories. To do this, you can reduce the search for each cluster by constructing a more complex data structure (KD tree).

The explanation of KD tree knowledge points can be divided into three aspects: the construction of 1.KD tree, the modification of 2.KD tree, and the use of 3.KD tree in Kmeans cluster.

Building a 1.KD tree:

(1) using (k%j+1) =i to select the data dimension number of the K-th calculation, where% is the mod operation. ）

(2) Calculate the median of the data, and use the corresponding data as the dividing center of the data set at this time;

(3) Using the dividing center, the data set is divided into double, then repeat steps (1) and (2) to the dataset until it can no longer be divided.

This method can be used to build a binary tree, which is the KD tree. (note here, has not understood the KD tree name origin.) Just know that k means that each data has a K feature dimension. ）

2.KD Tree Modification:

KD Tree modification includes: KD tree node insertion and KD tree node deletion (note, KD tree modification, in fact, in the Kmeans clustering algorithm there is no much use, here just a little mention, right when is learning. ）。

The insertion of KD trees in the lookup process is somewhat similar to a binary search tree, after insertion, is modified using a two-fork balance tree after the insertion of the correction method.

According to the previous achievements, the selected comparison axis order, compare the size of the data corresponding to the dimension, to select the direction of the data. Until the null pointer appears, you know where you want to insert it. However, only this insertion, will inevitably destroy the KD tree, so it needs a certain correction. This correction is similar to the modification of a binary balance tree to a tree after inserting a new node.

The use of 3.KD tree in Kmeans algorithm

If the Kmeans algorithm is simply executed, then each cluster needs to traverse all the data in the space, so the cost is great for the larger data dimension and the larger data volume. For this, we need to use KD tree to let Kmeans number of data search can be done locally.

For the KD tree lookup, we are more often using it to find the nearest neighbor's data. But what if it's a K-nearest neighbor's data? is to find the nearest neighbor, ignore the closest neighbor, and then find the closest neighbor.

From Kmeans to KD tree search

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

From Kmeans to KD tree search

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

From Kmeans to KD tree search

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support