Weka algorithm Classifier-tree-J48 source code analysis (2) ClassifierTree

Source: Internet
Author: User

Weka algorithm Classifier-tree-J48 source code analysis (2) ClassifierTree


I. Problems

This article focuses on the implementation of J48 with four issues.

1. How to control the accuracy of the classification tree.

2. How to Handle missing values (MissingValue)

3. How to discretization continuous values.

4. How to pruning the classification tree.


Ii. BuildClassifier

Each classifier will implement this method, pass in an Instances object, and construct a classification tree based on this object. The core code is as follows:

public void buildClassifier(Instances instances)        throws Exception {    ModelSelection modSelection;     if (m_binarySplits)      modSelection = new BinC45ModelSelection(m_minNumObj, instances);    else      modSelection = new C45ModelSelection(m_minNumObj, instances);    if (!m_reducedErrorPruning)      m_root = new C45PruneableClassifierTree(modSelection, !m_unpruned, m_CF,    m_subtreeRaising, !m_noCleanup);    else      m_root = new PruneableClassifierTree(modSelection, !m_unpruned, m_numFolds,   !m_noCleanup, m_Seed);    m_root.buildClassifier(instances);    if (m_binarySplits) {      ((BinC45ModelSelection)modSelection).cleanup();    } else {      ((C45ModelSelection)modSelection).cleanup();    }  }
We can see that this code logic is very clear. First, we construct a ModelSelection based on whether it is a binary tree (that is, each node has only two options, then, construct the corresponding ClassifierTree based on whether the m_performanceerrorpruning flag exists, build a model on the tree, and finally clean up the data (mainly to release the pointer, prevent the GC from being released when the upper-layer caller wants to release Instances because the Tree holds the Instances pointer ).


Iii. C45PruneableClassifierTree

(1) This class also implements the BuildCClassifier method to build a classifier. Let's take a look at the main logic of this method. The Code is as follows:

  public void buildClassifier(Instances data) throws Exception {    // can classifier tree handle the data?    getCapabilities().testWithFail(data);    // remove instances with missing class    data = new Instances(data);    data.deleteWithMissingClass();       buildTree(data, m_subtreeRaising || !m_cleanup);   collapse();   if (m_pruneTheTree) {     prune();   }   if (m_cleanup) {     cleanup(new Instances(data, 0));   }  }
First, testWithFail checks whether the imported data can be classified by this classifier. For example, C45 can only classify the Instances whose values are discrete values for the attribute to be classified, this test is used to detect such logic.

Then, clear the invalid rows in instances (the rows with the corresponding category attribute being null ).

On this data, buildTree is called to build a classification tree.

Call collapse () to "collapse" the tree (here I don't know how to translate it)

If necessary, prune () is pruned.

Finally, clean up the data.

(2) First, let's look at the buildTree function.

 public void buildTree(Instances data, boolean keepData) throws Exception {        Instances [] localInstances;    if (keepData) {      m_train = data;    }    m_test = null;    m_isLeaf = false;    m_isEmpty = false;    m_sons = null;    m_localModel = m_toSelectModel.selectModel(data);    if (m_localModel.numSubsets() > 1) {      localInstances = m_localModel.split(data);      data = null;      m_sons = new ClassifierTree [m_localModel.numSubsets()];      for (int i = 0; i < m_sons.length; i++) {m_sons[i] = getNewTree(localInstances[i]);localInstances[i] = null;      }    }else{      m_isLeaf = true;      if (Utils.eq(data.sumOfWeights(), 0))m_isEmpty = true;      data = null;    }  }
The logic of this function is also relatively simple (why is it simple ?!), First, determine whether data should be held based on input parameters.

Then select a model based on m_toSelectModel and divide the imported dataset into different subsets according to the corresponding rules. This selectModel is input by the constructor. See the main process described earlier. If this step corresponds to the algorithm description of the previous blog, the obtained subSet is the dv of the first row.

Then judge the number of subsets. If there is only one, it is a leaf node, and it will return if nothing is needed.

Otherwise, data is divided into different subInstances Based on localModel, and a new ClassifierTree node is created for each subInstances as its own child node, and the getNewTree function is called to construct a new tree for each subInstances.

(3) Use DFS to check the logic of getNewTree.

  protected ClassifierTree getNewTree(Instances data) throws Exception {     ClassifierTree newTree = new ClassifierTree(m_toSelectModel);    newTree.buildTree(data, false);        return newTree;  }
It is a recursive call.

(4) return to the C45PruneableClassifierTree. buildClassifier method to study the collapse function.

/*** Collapses a tree to a node if training error doesn't increase. */public final void collapse () {double errorsOfSubtree; double errorsOfTree; int I; if (! M_isLeaf) {errorsOfSubtree = getTrainingErrors (); errorsOfTree = localModel (). distribution (). numIncorrect (); if (errorsOfSubtree> = errorsOfTree-1E-3) {// Free adjacent treesm_sons = null; m_isLeaf = true; // Get NoSplit Model for tree. m_localModel = new NoSplit (localModel (). distribution ();} elsefor (I = 0; I
 
  
Note that if many child nodes exist on the node but these child nodes cannot improve the accuracy of the Classification Tree, delete these child nodes. Otherwise, the child will collapse recursively. The collapse method can reduce the depth of decision tree without reducing the accuracy, thus improving the efficiency.
  

Briefly describe how to estimate the current node error, that is, localModel (). distribution (). numIncorrect ();

First, obtain a distribution in the current training set, and then find the number of the attribute with the largest number in the distribution. If it is "correct", the rest will be wrong.

GetTrainingError is used to perform the preceding operations on each child node and then add the results.

(5) let's take a look at the prune () method, which is also the last step in BuildClassifier of C45PruneableClassifierTree.

This function is relatively long, so I directly write the analysis of this function in the comment.

Public void prune () throws Exception {double errorsLargestBranch; // The child node of this tree node must have the most data allocated to it, this value records the number of Use Cases of the child node category error double errorsLeaf; // If the node becomes a leaf node, the number of Use Cases of category error double errorsTree; // currently, the number of error cases for this node is int indexOfLargestBranch; // index C45PruneableClassifierTree largestBranch in the son array of the child node with the most data; // son [indexOfLargestBranch] int I; if (! M_isLeaf ){
// First, if it is a leaf node, perform the prune () operation on all children in the recursive team (). For (I = 0; I
   
    
// Through dataset distribution, you can easily find indexOfLargetBranch indexOfLargestBranch = localModel (). distribution (). maxBag (); if (m_subtreeRaising ){
// M_subtreeRaising is a flag, which indicates whether the tree's subtree can be used to replace the tree. If this flag is available, the maximum number of subtree errors is calculated.
// Otherwise, the Double. Max_Value is simple.
// Do not expand the estimation of the number of errors. Simply put, it is still a statistical statement based on the distribution (and a m_CF-based correction). If it is not a leaf node, it is delivered
// Return statistics. ErrorsLargestBranch = son (indexOfLargestBranch ). getEstimatedErrorsForBranch (Instances) m_train);} else {errorsLargestBranch = Double. MAX_VALUE;} // estimate the approximate number of errors if the node becomes a leaf node: errorsLeaf = getEstimatedErrorsForDistribution (localModel (). distribution ());
// Estimate the number of error cases for this node. ErrorsTree = getEstimatedErrors (); // Utils. smOrEq is smaller or equal, that is, <= meaning if (Utils. smOrEq (errorsLeaf, errorsTree + 0.1) & Utils. smOrEq (errorsLeaf, errorsLargestBranch + 0.1) {// if the number of errors of the current node as a leaf node is lower than that of the entire tree, in addition, if the current node has a lower error rate than the maximum subtree, the current node is // The best choice for leaf nodes. M_sons = null; m_isLeaf = true; // Get NoSplit Model for node. m_localModel = new NoSplit (localModel (). distribution (); return; // direct return} // Decide if largest branch is better choice // than whole subtree. if (Utils. smOrEq (errorsLargestBranch, errorsTree + 0.1 )){
// If the number of error use cases on the current node is greater than the maximum subtree, use the maximum subtree to replace the current node. LargestBranch = son (criteria); m_sons = largestBranch. m_sons; m_localModel = criteria. localModel (); m_isLeaf = largestBranch. m_isLeaf; newDistribution (m_train); prune ();}}}

To sum up collapse and prune in one sentence: prune may affect precision, but collapse will not.


4. PruneableClassifierTree

In the main J48 process, two different classifiertrees are selected based on m_reducedErrorPruning. One is analyzed just now, and the other is PruneeableClassifierTree.

(1) buildClassifier

  public void buildClassifier(Instances data)        throws Exception {    // can classifier tree handle the data?    getCapabilities().testWithFail(data);    // remove instances with missing class    data = new Instances(data);    data.deleteWithMissingClass();       Random random = new Random(m_seed);   data.stratify(numSets);   buildTree(data.trainCV(numSets, numSets - 1, random),     data.testCV(numSets, numSets - 1), !m_cleanup);   if (pruneTheTree) {     prune();   }   if (m_cleanup) {     cleanup(new Instances(data, 0));   }  }
Different from C45PruneableClassifierTree, in buildTree, apart from passing in the training set and the test set, the Collapse step is missing, and the rest are the same.

Next, let's take a look at the differences between the build passed in the test set and the build analyzed earlier.

(2) buildTree

Public void buildTree (Instances train, Instances test, boolean keepData) throws Exception {Instances [] localTrain, localTest; int I; if (keepData) {m_train = train;} m_isLeaf = false; m_isEmpty = false; m_sons = null; m_localModel = m_toSelectModel.selectModel (train, test); m_test = new Distribution (test, m_localModel); if (m_localModel.numSubsets ()> 1) {localTrain = m_localModel.split (train); localTest = m_localModel.split (test); train = test = null; m_sons = new ClassifierTree [m_localModel.numSubsets ()]; for (I = 0; I
     
      
As you can see, the code is basically the same. The only difference is that the test will be passed into the selectModel, and the Model implementation will be detailed in the next blog.
      

Prune is also simpler, removing the features of subTreeRasing.

  public void prune() throws Exception {      if (!m_isLeaf) {            // Prune all subtrees.      for (int i = 0; i < m_sons.length; i++)son(i).prune();            // Decide if leaf is best choice.      if (Utils.smOrEq(errorsForLeaf(),errorsForTree())) {// Free son Treesm_sons = null;m_isLeaf = true;// Get NoSplit Model for node.m_localModel = new NoSplit(localModel().distribution());      }    }  }


V. Summary

So far, the analysis of buildClassifier for the two classifiertrees is almost over. In general, ClassifierTree builds and maintains the structure of the classification tree through the input Model, in addition, after the build is complete, it will be trimmed according to different logic.


For questions raised at the beginning of the article, you can answer question 4 at present. Simply put, based on the distribution of existing datasets, pruning is performed on the basis of determining the decision tree, the maximum subtree of the decision tree, and the accuracy of the decision tree as a leaf node.

The next article mainly analyzes the implementation of the Model, that is, how to break down the existing dataset into subInstances Based on the attribute.












Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.