Birch algorithm---Multi-stage algorithm using clustering feature tree

Source: Internet
Author: User

More data Mining code: Https://github.com/linyiqun/DataMiningAlgorithm

Introduction

The birch algorithm itself belongs to a clustering algorithm, but he overcomes some of the shortcomings of the K-means algorithm, such as the k determination, because the algorithm itself has not set the number of clusters. He was implemented by Cf-tree, (clusterfeature-tree) clustering feature trees. An important consideration for birch is to minimize I/O, by scanning the database, creating an initial cf-tree that is stored in memory, and can see multi-layer compression of long data.

algorithm principleCF Clustering features

When it comes to the principle of arithmetic, first of all we must first know, what is the clustering feature, what is the clustering feature, defined as follows:

CF = <n, LS, ss>

The clustering feature is a 3-dimensional vector, n is the total number of points, LS is a linear sum of n dots, and SS is the sum of the squares of n points. So you can get

x0 = ls/n is the center of the cluster, which calculates the distance between the cluster and the cluster.

The average distance cluster diameter of an object within a cluster, which can be limited by a threshold T, to ensure the compactness of a whole cluster. A stack can be superimposed between clusters and clusters, which is actually the superposition of vectors.

the construction process of Cf-tree

In the introduction of Cf-tree tree, we introduce 3 variables, internal node balance factor B, leaf node balance factor l, cluster diameter threshold T. b is used to limit the number of child nodes of non-leaf nodes, L is used to limit the number of sub-clusters of leaf nodes, T is used to limit the extent of the cluster, compared to the average object within the d--cluster distance. The following are the main construction processes:

1, first read into the first data, constructs a leaf node and a sub-cluster, the sub-cluster is contained in the leaf node.

2, when read into the following 2nd, 3rd, encapsulated as a cluster, joined to a leaf node, if at this time to join the cluster C cluster diameter is greater than t, you need to create a new cluster as a c sibling node, if as a sibling node, if the child node at this time is more than the threshold L, the leaf node to be split. The rule of division is to select the 2 children with the largest distance between the clusters, respectively, as 2 leaves, and then the other children are assigned to the nearest one. The split rule for non-leaf nodes is ibid. Specific can be compared to the code I wrote later.

3, the final structure of the appearance of the general:


Advantages of the algorithm:

1, the algorithm just scan once can get a good clustering effect, and do not need to set the number of clusters beforehand.

2, clustering through the form of clustering feature tree, to a certain extent, the preservation of data compression.

Disadvantages of the algorithm:

1, the algorithm is suitable for spherical clusters, if the cluster is not spherical, then the effect of clustering will not be very good.

The code implementation of the algorithm:

Some core code is provided below (if you want to get all the code, please click My Data Mining code ):

Input of data:

5.1     3.5     1.4     0.24.9     3.0     1.4     0.24.7     3.2     1.3     0.84.6     3.1     1.5     0.85.0     3.6     1.8     0.64.7     3.2     1.4     0.8

Clusteringfeature.java:

Package Datamining_birch;import java.util.arraylist;/** * Clustering feature basic Properties * * @author LYQ * */public abstract class CLUSTERINGF The total number of nodes in the eature {///subclass protected int n;//The linear and protected double[of N nodes in the subclass (n nodes) ls;//the sum of the squares and protected double[] ss;//node depth, Output for CF tree protected int level;public int getn () {return N;} public void Setn (int n) {n = n;} Public double[] Getls () {return LS;} public void Setls (double[] ls) {ls = ls;} Public double[] Getss () {return SS;} public void Setss (double[] ss) {SS = SS;} protected void Setn (arraylist<double[]> datarecords) {this. N = Datarecords.size ();} public int Getlevel () {return level;} public void SetLevel (Int. level) {this.level = level;} /** * Calculates linearity based on node data and * * @param datarecords * node data record */protected void Setls (arraylist<double[]> datarecords ) {int num = datarecords.get (0). length;double[] Record; LS = new Double[num];for (int j = 0; j < Num; J + +) {Ls[j] = 0;} for (int i = 0; i < datarecords.size (), i++) {record = Datarecords.get (i); for (int j = 0; J < Record.length; J + +) {Ls[j] + = Record[j];}}} /** * Calculates the square based on the node data * * @param datarecords * node data */protected void Setss (arraylist<double[]> datarecords) { int num = datarecords.get (0). length;double[] Record; SS = new Double[num];for (int j = 0; j < Num; J + +) {Ss[j] = 0;} for (int i = 0; i < datarecords.size (); i++) {record = Datarecords.get (i), for (int j = 0; J < Record.length; J + +) {S S[J] + = record[j] * record[j];}}} /** * CF vector feature overlay, no need to consider partitioning * * @param node */protected void Directaddcluster (clusteringfeature node) {int N = NODE.GETN ();d ou ble[] Otherls = Node.getls ();d ouble[] Otherss = NODE.GETSS (); if (LS = = null) {this. N = 0; LS = new Double[otherls.length]; SS = new Double[otherls.length];for (int i=0; i<ls.length; i++) {ls[i] = 0; Ss[i] = 0;}} 3 quantities are superimposed for (int i = 0; i < ls.length; i++) {Ls[i] + + otherls[i]; Ss[i] + = Otherss[i];} This. n + = n;} /** * Calculates the distance between clusters and clusters-the distance between cluster centers * * @return */protected double computerclusterdistance (clusteringfeature cluster) {double Dista nCE = 0;double[] Otherls = cluster. Ls;int num = n;int othernum = cluster. n;for (int i = 0; i < ls.length; i++) {distance + = (Ls[i]/num-otherls[i]/othernum) * (Ls[i]/num-otherls[i]/O Thernum);} Distance = math.sqrt (distance); return distance;} /** * Calculates the average distance of objects within a cluster * * @param records * Data record in cluster * @return */protected double computerinclusterdistance (arraylis  T<double[]> Records) {Double sumdistance = 0;double[] data1;double[] data2;//total data int totalnum = Records.size (); for (int i = 0; i < totalNum-1; i++) {data1 = Records.get (i); for (int j = i + 1; j < Totalnum; J + +) {data2 = Records.get (j); Sumdistance + = Computeoudistance (Data1, data2);}} The returned value is divided by the total logarithm, the total logarithm should be halved, and the return math.sqrt (Sumdistance/(Totalnum * (totalNum-1)/2) will be counted again;} /** * For a given 2 vectors, calculate euclidean distance * * @param record1 * Vector point 1 * @param record2 * Vector point 2 */private double Computeoud Istance (double[] record1, double[] record2) {Double distance = 0;for (int i = 0; i < record1.length; i++) {DistanCe + = (record1[i]-record2[i]) * (Record1[i]-record2[i]);} return distance;} /** * Cluster Add node includes operations that exceed the threshold for splitting * * @param clusteringfeature * To be added clustered */public abstract void Addingcluster (Clusterin Gfeature clusteringfeature);}
Birchtool.java:

Package Datamining_birch;import Java.io.bufferedreader;import Java.io.file;import java.io.filereader;import Java.io.ioexception;import java.text.messageformat;import Java.util.arraylist;import java.util.LinkedList;/** * Birch Clustering Algorithm Tool class * * @author Lyq * */public class Birchtool {//node type name public static final String Non_leafnode = "" Nonleafnod  E "";p ublic static final String leafnode = "Leafnode" ";p ublic static final String CLUSTER =" CLUSTER "";//test data file address private String filepath;//Internal node balance factor bpublic static int b;//leaf node balance factor lpublic static int l;//cluster diameter threshold tpublic static double t;//Total test data Record Private arraylist<string[]> totaldatarecords;public Birchtool (String filePath, int B, int L, double T) {This.file Path = Filepath;this. B = B;this. L = L;this. T = T;readdatafile ();} /** * Read data from file */private void Readdatafile () {File File = new file (FilePath); arraylist<string[]> DataArray = new arraylist<string[]> (); try {bufferedreader in = new BufferedReader (new FileReader (file)); String str; StriNg[] Temparray;while ((str = in.readline ()) = null) {Temparray = Str.split ("");d Ataarray.add (Temparray);} In.close ();} catch (IOException e) {e.getstacktrace ();} Totaldatarecords = new arraylist<> (); for (string[] array:dataarray) {totaldatarecords.add (array);}} /** * Build CF Clustering feature tree * * @return */private clusteringfeature buildcftree () {Nonleafnode rootNode = null; Leafnode leafnode = null; Cluster Cluster = null;for (string[] record:totaldatarecords) {Cluster = new Cluster (record); if (RootNode = = null) {//C F Tree only 1 nodes when the case if (Leafnode = = null) {Leafnode = new Leafnode ();} Leafnode.addingcluster (cluster); if (Leafnode.getparentnode ()! = null) {RootNode = Leafnode.getparentnode ();}} else {if (rootnode.getparentnode () = null) {RootNode = Rootnode.getparentnode ();} Starting from the root node, look down from the top to the nearest add target leaf node Leafnode temp = Rootnode.findedclosestnode (cluster); Temp.addingcluster (cluster);}} Leafnode node = Cluster.getparentnode () from bottom to top Nonleafnode Upnode = Node.getparentnode (); if (Upnode = = nulL) {return node;} else {while (upnode.getparentnode () = null) {Upnode = Upnode.getparentnode ()} return upnode;}} /** * Start building CF Clustering feature tree */public void startbuilding () {//tree depth int level = 1; Clusteringfeature RootNode = Buildcftree (); Settreelevel (RootNode, level); Showcftree (RootNode);} /** * Set Node depth * * @param clusteringfeature * Current node * @param level * Current depth value */private void Settreelevel (clusteringfeature clusteringfeature, int level) {Leafnode leafnode = null;  Nonleafnode Nonleafnode = null;if (clusteringfeature instanceof leafnode) {Leafnode = (Leafnode) clusteringFeature;} else if (clusteringfeature instanceof nonleafnode) {Nonleafnode = (Nonleafnode) clusteringfeature;} if (Nonleafnode! = null) {Nonleafnode.setlevel (level); level++;//Set child node if (Nonleafnode.getnonleafchilds ()! = null) {for ( Nonleafnode N1:nonLeafNode.getNonLeafChilds ()) {Settreelevel (N1, level);}} else {for (Leafnode n2:nonLeafNode.getLeafChilds ()) {Settreelevel (N2, level);}}} else {Leafnode.setlevel (level); level++;//set sub-cluster for (Cluster c:leafnode.getclusterchilds ()) {c.setlevel (level);}}} /** * Show CF Clustering feature tree * * @param rootNode * CF tree root node */private void Showcftree (Clusteringfeature rootNode) {//space number, for output out int blanknum = 5;//current tree depth int currentlevel = 1; linkedlist<clusteringfeature> nodequeue = new linkedlist<> (); Clusteringfeature CF; Leafnode Leafnode; Nonleafnode Nonleafnode; arraylist<cluster> clusterlist = new arraylist<> (); String Typename;nodequeue.add (RootNode); while (Nodequeue.size () > 0) {cf = Nodequeue.poll (); if (cf instanceof Leafnode) {Leafnode = (leafnode) cf;typename = Leafnode;if (Leafnode.getclusterchilds ()! = null) {for (Cluster C:leafnod E.getclusterchilds ()) {Nodequeue.add (c);}}} else if (cf instanceof nonleafnode) {Nonleafnode = (nonleafnode) cf;typename = non_leafnode;if (Nonleafnode.getnonleafchi LDS ()! = null) {for (Nonleafnode n1:nonLeafNode.getNonLeafChilds ()) {Nodequeue.add (n1),}} else {for (Leafnode N2:nonle Afnode.getleafchilds ()) {Nodequeue.add (n2);}}} else {clusterlist.add (Cluster) CF); typeName = Cluster;} if (currentlevel! = Cf.getlevel ()) {currentlevel = Cf.getlevel (); System.out.println (); System.out.println ("|"); System.out.println ("|");} else if (CurrentLevel = = Cf.getlevel () && currentlevel! = 1) {for (int i = 0; i < Blanknum; i++) {System.out.prin T ("-");}} System.out.print (TypeName); System.out.print ("N:" + cf.getn () + ", LS:"); System.out.print ("["); for (double D:cf.getls ()) {System.out.print (Messageformat.format ("{0},", D));} System.out.print ("]");} System.out.println (); System.out.println ("******* final sub-cluster");//Displays the cluster point for which the class has been divided for (int i=0; i<clusterlist.size (); i++) { System.out.println ("Cluster" + (i+1) + ":"); for (double[] Point:clusterList.get (i). GetData ()) {System.out.print ("["); for (double d:point) {System.out.print (Messageformat.format ("{0},", D));} System.out.println ("]");}}}
due to the large code size, the remaining Leafnode.java,nonleafnode.java, and cluster cluster classes can be My Data Mining codein the view.

Result output:

"Nonleafnode" N:6, ls:[29, 19.6, 8.8, 3.4,]| | "Leafnode" N:3, ls:[14, 9.5, 4.2, 2.4,]-----"Leafnode" N:3, ls:[15, 10.1, 4.6, 1,]| | "Cluster" N:3, ls:[14, 9.5, 4.2, 2.4,]-----"Cluster" N:1, ls:[5, 3.6, 1.8, 0.6,]-----"Cluster" N:2, ls:[10, 6.5, 2.8, 0.4, ]******* Final sub-cluster ****cluster1:[4.7, 3.2, 1.3, 0.8,][4.6, 3.1, 1.5, 0.8,][4.7, 3.2, 1.4, 0.8,]cluster2:[5, 3.6, 1.8, 0.6, ]cluster3:[5.1, 3.5, 1.4, 0.2,][4.9, 3, 1.4, 0.2,]
difficulties in the implementation of the algorithm

1, calculate the distance between the cluster, substituting for a formula, found that something is wrong, the operation of the vector should not be such, so he will be returned to the distance between the cluster heart calculation. There are also the average distance of the objects in the cluster is not the formula, the various versions of the online vector calculation, do not know which is the right, and in the most primitive way to calculate, a pair of computing distance, averaging.

2, the algorithm when the node split, if the parent node is not empty, you need to remove themselves from the father's child list, and then add the split 2 nodes, where the removal of their own easy to forget.

3, the node CF clustering eigenvalue update, need to change in each node, its parent class, the parent class needs to be updated, for this use of the responsibility chain mode, a one to upload, split rules also used this mode, need to pay attention to.

4, the code extracts the characteristics of CF clustering, defines a common method, but in the implementation or due to the different types of nodes, in the actual process needs to be transformed.

5, the final difficulty in the complex with the test, because the program after arduous writing finally completed, but how to test a big problem, because to divide the situation are measured, need to accurately grasp the T.,B.L, the design, especially the T cluster diameter, so in the fabrication of testing itself is a lot of manual calculation.

my understanding of the birch algorithm

In the implementation of the whole process of completion, my greatest feeling about the birch algorithm is through clustering features, a new node from the root node, from the top to first look for, from which cluster near, is divided into which cluster, spontaneous formation of a relatively good clustering, the process is the magic of the algorithm.

Birch algorithm---Multi-stage algorithm using clustering feature tree

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.