Birch of hierarchical clustering algorithm (multi-stage clustering of clustering feature trees)

Source: Internet
Author: User
Birch of Clustering algorithm (Java implementation)

BIRCH (Balanced iterative reducing and clustering using hierarchies) is inherently designed to handle data sets that are very large (at least for your memory) and can run in any given memory. About the more features of Birch first not introduced, I first talk about the full implementation of the algorithm details, the implementation of the algorithm to clear the process to see other people on the evaluation of the algorithm will be felt profound.

You don't need to have knowledge of B-trees, I'll make it clear.

The process of birch algorithm is to insert the data to be classified into a tree, and the original data is on the leaf node. This tree looks like this:

There are 3 types of nodes in this tree: nonleaf, leaf, mincluster,root may be a nonleaf, or it may be a leaf. All the leaf is placed in a doubly linked list. Each node contains a CF value, CF is a ternary group, where the number of data point instance, and the vector that is the same dimension as the DataSet, is linear and, is the sum of squares. For example, there is a mincluster containing 3 data points (three-to-one), (4,5,6), (7,8,9), then

N=3,

= (1+4+7,2+5+8,3+6+9) = (12,15,18),

= (1+16+49,4+25+64,9+36+81).

Take this mincluster as an example, we can calculate its

Cluster Center

Cluster radius

Cluster diameter

We can also calculate the distance between two clusters, of course you can also use d0,d1,d3 and so on, but here we use D2.

Interestingly, the cluster center, cluster radius, cluster diameter, and the distance between the two clusters D0 to D3 can be computed by CF, such as

Cluster diameter

The distance between clusters, where N,ls and SS are two clusters of N,ls and SS. The so-called two-cluster merging requires only two corresponding CF to add that can

CF1 + CF2 = (N1 + N2, LS1 + LS2, SS1 + SS2)

The CF value of each node is the sum of the CF values of all children's nodes, and the subtree with each node as the root node can be regarded as a cluster.

Nonleaf, Leaf, mincluster are limited in size, nonleaf child nodes can not exceed B, the leaf can only have a maximum of L mincluster, and a mincluster diameter can not exceed T.

Algorithm at first, we scanned the database and got the first data point instance--, we created an empty leaf and a mincluster, put the dot (three-in-one) ID value into Mincluster, The CF value of the update Mincluster is (1, (1), (1,4,9)), and the Mincluster as a child of the leaf, the CF value of the updated leaf is ((+), (1,4,9)). Actually just put a CF into the tree (here we use CF as the nonleaf, Leaf, mincluster), we need to update the CF value of all the nodes on the path from root to the leaf node.

When there is another data point to be inserted into the tree, the point is encapsulated as a mincluster (so it has a CF value), the new data point is recorded as Cf_new, we get the tree root node of each child node CF value, according to D2 to find cf_new with which node recently, the CF _new joined the subtree. This is a recursive process. The terminating point of recursion is to add the cf_new to a mincluster, if the diameter of the mincluster is not more than T, then join directly, otherwise he cf_new to be a cluster alone, become the Mincluster brothers node. Note that the CF values of the node and all its ancestor nodes are updated after insertion.

After inserting a new node, some nodes may have more children than B (or L), at which point the node splits. For the leaf, it now has l+1 a mincluster, and we want to create a new leaf to make it the sibling node of the original leaf, noting that each new leaf is inserted into a doubly linked list. L+1 a mincluster to be divided into these two leaf, how to divide it. Find out the two cluster (according to D2) in this l+1 mincluster, and the rest of the cluster see who's standing close to each other. The CF value of the two leaf is updated, and the CF value of its ancestor node does not change, and it does not need to be updated. This may result in recursive splitting of the ancestor nodes, since the leaf splits just as many children of the parent node as the number B. Nonleaf's splitting method is similar to the leaf, except that it does not need to be placed in a doubly linked list after a new nonleaf is produced. If the root node of the tree is to be split, the height of the tree is added 1.

Cf.java

Package Birch;      public class CF {private int N;      Private double [] LS;        Private double [] SS;          Public CF () {ls= new double [birch.dimen]; ss= new double [birch.dimen];

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.