Birch of Clustering algorithm (Java implementation)
BIRCH (Balanced iterative reducing and clustering using hierarchies) is inherently designed to handle data sets that are very large (at least for your memory) and can run in any given memory. About the more features of Birch first not introduced, I first talk about the full implementation of the algorithm details, the implementation of the algorithm to clear the process to see other people on the evaluation of the algorithm will be felt profound.
You don't need to have knowledge of B-trees, I'll make it clear.
The process of birch algorithm is to insert the data to be classified into a tree, and the original data is on the leaf node. This tree looks like this:
There are 3 types of nodes in this tree: nonleaf, leaf, mincluster,root may be a nonleaf, or it may be a leaf. All the leaf is placed in a doubly linked list. Each node contains a CF value, CF is a ternary group, where the number of data point instance, and the vector that is the same dimension as the DataSet, is linear and, is the sum of squares. For example, there is a mincluster containing 3 data points (three-to-one), (4,5,6), (7,8,9), then
N=3,
= (1+4+7,2+5+8,3+6+9) = (12,15,18),
= (1+16+49,4+25+64,9+36+81).
Take this mincluster as an example, we can calculate its
Cluster Center
Cluster radius
Cluster diameter
We can also calculate the distance between two clusters, of course you can also use d0,d1,d3 and so on, but here we use D2.
Interestingly, the cluster center, cluster radius, cluster diameter, and the distance between the two clusters D0 to D3 can be computed by CF, such as
Cluster diameter
The distance between clusters, where N,ls and SS are two clusters of N,ls and SS. The so-called two-cluster merging requires only two corresponding CF to add that can
CF1 + CF2 = (N1 + N2, LS1 + LS2, SS1 + SS2)
The CF value of each node is the sum of the CF values of all children's nodes, and the subtree with each node as the root node can be regarded as a cluster.
Nonleaf, Leaf, mincluster are limited in size, nonleaf child nodes can not exceed B, the leaf can only have a maximum of L mincluster, and a mincluster diameter can not exceed T.
Algorithm at first, we scanned the database and got the first data point instance--, we created an empty leaf and a mincluster, put the dot (three-in-one) ID value into Mincluster, The CF value of the update Mincluster is (1, (1), (1,4,9)), and the Mincluster as a child of the leaf, the CF value of the updated leaf is ((+), (1,4,9)). Actually just put a CF into the tree (here we use CF as the nonleaf, Leaf, mincluster), we need to update the CF value of all the nodes on the path from root to the leaf node.
When there is another data point to be inserted into the tree, the point is encapsulated as a mincluster (so it has a CF value), the new data point is recorded as Cf_new, we get the tree root node of each child node CF value, according to D2 to find cf_new with which node recently, the CF _new joined the subtree. This is a recursive process. The terminating point of recursion is to add the cf_new to a mincluster, if the diameter of the mincluster is not more than T, then join directly, otherwise he cf_new to be a cluster alone, become the Mincluster brothers node. Note that the CF values of the node and all its ancestor nodes are updated after insertion.
After inserting a new node, some nodes may have more children than B (or L), at which point the node splits. For the leaf, it now has l+1 a mincluster, and we want to create a new leaf to make it the sibling node of the original leaf, noting that each new leaf is inserted into a doubly linked list. L+1 a mincluster to be divided into these two leaf, how to divide it. Find out the two cluster (according to D2) in this l+1 mincluster, and the rest of the cluster see who's standing close to each other. The CF value of the two leaf is updated, and the CF value of its ancestor node does not change, and it does not need to be updated. This may result in recursive splitting of the ancestor nodes, since the leaf splits just as many children of the parent node as the number B. Nonleaf's splitting method is similar to the leaf, except that it does not need to be placed in a doubly linked list after a new nonleaf is produced. If the root node of the tree is to be split, the height of the tree is added 1.
Cf.java
Package Birch; public class CF {private int N; Private double [] LS; Private double [] SS; Public CF () {ls= new double [birch.dimen]; ss= new double [birch.dimen]; |