Principle of birch Clustering algorithm

Source: Internet
Author: User

In the principle of K-means clustering algorithm, we talk about the clustering principle of K-means and mini Batch K-means. Here we look at another common clustering algorithm, birch. The birch algorithm is suitable for the case that the data volume is large and the number of categories K is more. It runs fast, only need to scan the data set can be clustered, of course, need to use some skills, the following we birch algorithm to do a summary.

1. Birch Overview

The full name of Birch is the balanced iteration protocol and clustering using the hierarchical approach (Balanced iterative reducing and clustering Using hierarchies), which is too long, but it doesn't matter, In fact, just understand that it is a hierarchical approach to clustering and protocol data can be. Just now, birch only need to scan the data set to be able to cluster, then how does it do it?

The birch algorithm uses a tree structure to help us quickly cluster, a number structure similar to the balance B + tree, which is generally referred to as a clustering feature tree (clustering Feature tree, cf tree, for short). Each node of the tree is composed of a number of cluster features (clustering Feature, CF). We can see what the clustering feature tree looks like: Each node includes a leaf node with several CF, and the internal node CF has a pointer to the child's node, and all the leaf nodes are connected by a two-way chain.

With the concept of clustering feature tree, we further explain the cluster feature tree and the clustering characteristic CF of the nodes.

2. Clustering feature CF and clustering feature tree CF tree

In a clustering feature tree, a clustering feature, CF, is defined as: Each CF is a ternary group that can be represented by (N,LS,SS). where n represents the number of sample points that are in the CF, this is a good understanding; LS represents the sum of each characteristic dimension of the sample points in this CF, and the SS represents the square of each characteristic dimension of the sample points owned by this cf. For example, there are 5 samples (3,4), (2,6), (4,5), (4,7), (3,8) in one CF of a node in a CF tree. Then it corresponds to the n=5, ls=(3+2+4+4 +3, 4+6 +5 +7+ 8) = (16, 30) (3+2+4+4+3,4+6+5+7+8) = (16,30), SS =(32+22+42+42+32+42+62+ 52+72+8 2) = (54 +190) =244 (32+22+42+42+32+42+62+52+72+82) = (54+190) =244

CF has a very good property, which is to satisfy the linear relationship, that is,CF1+CF2=(N1+N2,ls1+l s2 ,ss1+s s2) cf1+cf2= (N1+N2,LS1+LS2,SS1+SS2). This nature is also well understood from the definition. If this property is placed on the CF tree, that is, in the CF tree, for each parent node in the CF node, its (N,LS,SS) ternary value equals the sum of the triples of all the child nodes that the CF node points to. As shown in the following:

As you can see, the value of the CF1 of the root node can be summed up from the values of the 6 child nodes (CF7-CF12) it points to. This allows us to be efficient when updating the CF tree.

For CF Tree, we generally have several important parameters, the first parameter is the maximum number of CF per internal Node B, the second parameter is the maximum number of CF per leaf node L, the third parameter is for the leaf node in a CF sample point, it is the leaf node each CF maximum sample radius threshold T, In other words, all sample points in this CF must be in a sphere with a radius less than T. For the CF Tree in, the limit is b=7, l=5, which means that the internal node has a maximum of 7 CF, and the leaf node has a maximum of 5 cf.

3. Generation of CF tree with clustering feature tree

Let's take a look at how to generate CF Tree. We first define the parameters of the CF tree: that is, the maximum CF number of the internal node B, the maximum number of CF of the leaf node L, the maximum sample radius threshold of each CF of the leaf node T

At the very beginning, the CF tree is empty, without any samples, we read from the training set into the first sample point, put it into a new CF ternary A, this ternary group of N=1, the new CF into the root node, at this time the CF tree such as:

Now that we continue to read the second sample point, we find that the sample point and the first sample point A, in the sphere radius of T, that is, they belong to a CF, we will add the second point CF A, at this time need to update the value of the ternary group A. At this point a n=2 in the ternary group. The CF tree at this time is as follows:

At this point the third node, we found that the node can not be integrated into the previous node formed by the super sphere, that is, we need a new CF ternary B, to accommodate this new value. At this point the root node has two CF triples A and B, at this time CF tree such as:

When we come to the fourth sample point, we find that a super sphere with a radius of less than T, so the updated CF tree such as:

When did the CF tree node need to be split? Suppose we now have a CF Tree such as the leaf node LN1 has three CF, LN2 and LN3 each have two cf. The maximum CF number of our leaf nodes is L = 3. At this point a new sample point came, and we found it closest to the LN1 node, so we began to judge whether it was within the sc1,sc2,sc3 of the 3 CF, but unfortunately, it was not there, so it needed to build a new CF, the SC8 to accommodate it. The problem is our l=3, that is to say LN1 the number of CF has reached the maximum, can not create a new CF, How to do? At this point the LN1 leaf node is divided into split.

We will LN1 all of the CF tuples, find two furthest CF to do the seed CF of these two new leaf nodes, and then divide the new tuple sc3 of all CF Sc1, SC2, SC8, and new sample points in the LN1 node to two new leaf nodes. The CF tree after dividing the LN1 node is as follows:

If our internal node's maximum CF number b=3, then the leaf node in two will cause the root node of the maximum CF number is super, that is, our root node is now splitting, splitting method and leaf node division, the split CF tree like:

With this series of figures above, I believe you have no problem with the insertion of CF tree, and summarize the insertion of CF tree:

1. Find the nearest CF node from the root node and the new sample distance from the nearest leaf node and leaf node

2. If the new sample is added, the corresponding hyper sphere radius of the CF node still satisfies less than the threshold T, then all CF triples on the update path are inserted to the end. otherwise transferred to 3.

3. If the number of CF nodes of the current leaf node is less than the threshold L, then a new CF node is created, the new CF node is put into the fresh sample, and the new CompactFlash nodes are placed in the leaf node, and all the CF triples on the path are updated to the end of the insertion. otherwise transferred to 4.

4. Divide the current leaf node into two new leaf nodes, select the two CF tuples from all CF tuples in the old leaf node, and distribute them as the first CF node of the two new leaf nodes. The other tuples and new sample tuples are placed in the corresponding leaf nodes according to the principle of distance. Check the parent node up and down to see if you want to split the same way as the leaf node splits.

4. Birch algorithm

The above said a half-day CF Tree, finally we can step into the subject Birch algorithm, in fact, all the training set of samples to establish a CF tree, a basic birch algorithm is completed, the corresponding output is a number of CF nodes, each node is a cluster of sample points. That is to say, the main process of birch algorithm is to build the process of CF tree.

Of course, the real birch algorithm in addition to the establishment of CF tree to cluster, in fact, there are some optional algorithm steps, now we look at the birch algorithm flow.

1) Read all the samples in turn, build a CF Tree in memory, and establish a method to refer to the previous section.

2) (optional) filter the CF tree created in the first step to remove some of the abnormal CF nodes, which typically have few sample points. Merging of tuples that are very close to some super spheres

3) (optional) Use other clustering algorithms such as K-means to cluster all CF tuples to get a better CF tree. The main purpose of this step is to eliminate the unreasonable tree structure caused by the sample reading order, and some tree structure splitting due to the limitation of the number of CF in the node.

4) (optional) using the centroid of all CF nodes of the CF tree generated by the third step, as the initial centroid point, all the sample points are clustered by distance. This further reduces the clustering unreasonable situation caused by some limitations of the CF tree.

As can be seen from the above, the key to the birch algorithm is Step 1, which is the generation of CF tree, the other steps are to optimize the final clustering results.

5. Birch Algorithm Summary

The birch algorithm does not have to enter a class number K value, which differs from K-means,mini Batch K-means. If you do not enter a K value, the number of groups of the last CF tuple is the final k, otherwise the CF tuple is merged by the distance size according to the input K value.

In general, the Birch algorithm is suitable for large sample sizes, similar to the mini batch k-means, but Birch is suitable for cases where the number of categories is relatively large, while mini batch K-means is generally used for moderate or low class numbers. Birch In addition to clustering can also do some additional anomaly detection and data preliminary by category protocol pretreatment. But if the data characteristics of the dimension is very large, such as greater than 20, then Birch is not suitable, at this time the mini Batch K-means performance is better.

The birch is more complex than the K-means,mini Batch K-means, because it needs to be tuned for several key parameters of the CF tree, which have a significant impact on the final form of the CF tree.

Finally, the advantages and disadvantages of the birch algorithm are summarized:

The main advantages of the birch algorithm are:

1) Save memory, all samples are on disk, cf tree only save CF node and corresponding pointer.

2) Clustering speed, only need to scan the training set can be set up CF tree,cf Tree additions and deletions are very fast.

3) can identify the noise point, but also the data set for preliminary classification of pre-processing

The main drawbacks of the birch algorithm are:

1) because CF tree has a limit on the number of CF per node, the result of clustering may be different from the actual category distribution.

2) The data clustering effect on high-dimensional features is not good. At this point you can select Mini Batch K-means

3) If the dataset's distribution cluster is not similar to a hyper-sphere, or is not convex, the clustering effect is not good.

Principle of birch Clustering algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.