From Kmeans to KD tree search

Source: Internet
Author: User

The Kmeans algorithm is a very common clustering algorithm.

The procedure of the algorithm is as follows:

(1) Through the problem analysis, determine the number of categories to be clustered K; (generally it is difficult to determine directly, you can use the method of cross-validation, and so on.) )

(2) According to the problem type, determine the calculation method of the similarity between the calculated data;

(3) randomly select K data from the data set as a cluster center;

(4) Calculate the similarity between each data and the center of the cluster by using the formula of similarity degree. Select the cluster center with the most similarity, as the class to which the data point belongs.

(5) Use (4) to determine the category of each data point and recalculate each new cluster center;

(6) Repeat steps (4) and (5) until all cluster centers are stable or meet the stop requirements set previously (for example, a threshold less than the change in the data category).

Kmeans clustering can show better results in the case of small amount of data. However, in the case of large data volumes and high feature dimensions for each data, it is very costly to perform kmeans clustering. Because, each clustering process needs to calculate the entire data space. The calculated amount is nxk. where n is the number of data, K is the number of cluster categories. To do this, you can reduce the search for each cluster by constructing a more complex data structure (KD tree).

The explanation of KD tree knowledge points can be divided into three aspects: the construction of 1.KD tree, the modification of 2.KD tree, and the use of 3.KD tree in Kmeans cluster.

Building a 1.KD tree:

(1) using (k%j+1) =i to select the data dimension number of the K-th calculation, where% is the mod operation. )

(2) Calculate the median of the data, and use the corresponding data as the dividing center of the data set at this time;

(3) Using the dividing center, the data set is divided into double, then repeat steps (1) and (2) to the dataset until it can no longer be divided.

This method can be used to build a binary tree, which is the KD tree. (note here, has not understood the KD tree name origin.) Just know that k means that each data has a K feature dimension. )

2.KD Tree Modification:

KD Tree modification includes: KD tree node insertion and KD tree node deletion (note, KD tree modification, in fact, in the Kmeans clustering algorithm there is no much use, here just a little mention, right when is learning. )。

The insertion of KD trees in the lookup process is somewhat similar to a binary search tree, after insertion, is modified using a two-fork balance tree after the insertion of the correction method.

According to the previous achievements, the selected comparison axis order, compare the size of the data corresponding to the dimension, to select the direction of the data. Until the null pointer appears, you know where you want to insert it. However, only this insertion, will inevitably destroy the KD tree, so it needs a certain correction. This correction is similar to the modification of a binary balance tree to a tree after inserting a new node.

The use of 3.KD tree in Kmeans algorithm

If the Kmeans algorithm is simply executed, then each cluster needs to traverse all the data in the space, so the cost is great for the larger data dimension and the larger data volume. For this, we need to use KD tree to let Kmeans number of data search can be done locally.

For the KD tree lookup, we are more often using it to find the nearest neighbor's data. But what if it's a K-nearest neighbor's data? is to find the nearest neighbor, ignore the closest neighbor, and then find the closest neighbor.

From Kmeans to KD tree search

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.