Source code analysis of Weka algorithm Classifier-tree-J48 (3) ModelSelection

Source: Internet
Author: User

Source code analysis of Weka algorithm Classifier-tree-J48 (3) ModelSelection


ModelSelection is mainly used to select the appropriate columns to split the dataset. Combined with the main process of J48 in the previous article, we found that ModelSelection is C45ModelSelection and BinC45ModelSelection, so we can analyze C45ModelSelection first.


I. C45ModelSelection

As a ModelSelection interface, there are two main implementation methods: selectModel (Instances) and selectionModel (Instances, Instances ). The next method of C45ModelSelection is as follows:

  public final ClassifierSplitModel selectModel(Instances train, Instances test) {    return selectModel(train);  }
We can see that the selectModel method is called directly without the test set. Therefore, the selectModel method is mainly used for word splitting.

First release the entire code segment, and then analyze the code segment:

public final ClassifierSplitModel selectModel(Instances data){    double minResult;    double currentResult;    C45Split [] currentModel;    C45Split bestModel = null;    NoSplit noSplitModel = null;    double averageInfoGain = 0;    int validModels = 0;    boolean multiVal = true;    Distribution checkDistribution;    Attribute attribute;    double sumOfWeights;    int i;        try{      // Check if all Instances belong to one class or if not      // enough Instances to split.      checkDistribution = new Distribution(data);      noSplitModel = new NoSplit(checkDistribution);      if (Utils.sm(checkDistribution.total(),2*m_minNoObj) ||  Utils.eq(checkDistribution.total(),   checkDistribution.perClass(checkDistribution.maxClass())))return noSplitModel;      // Check if all attributes are nominal and have a       // lot of values.      if (m_allData != null) {Enumeration enu = data.enumerateAttributes();while (enu.hasMoreElements()) {  attribute = (Attribute) enu.nextElement();  if ((attribute.isNumeric()) ||      (Utils.sm((double)attribute.numValues(),(0.3*(double)m_allData.numInstances())))){    multiVal = false;    break;  }}      }       currentModel = new C45Split[data.numAttributes()];      sumOfWeights = data.sumOfWeights();      // For each attribute.      for (i = 0; i < data.numAttributes(); i++){// Apart from class attribute.if (i != (data).classIndex()){    // Get models for current attribute.  currentModel[i] = new C45Split(i,m_minNoObj,sumOfWeights);  currentModel[i].buildClassifier(data);    // Check if useful split for current attribute  // exists and check for enumerated attributes with   // a lot of values.  if (currentModel[i].checkModel())    if (m_allData != null) {      if ((data.attribute(i).isNumeric()) ||  (multiVal || Utils.sm((double)data.attribute(i).numValues(),(0.3*(double)m_allData.numInstances())))){averageInfoGain = averageInfoGain+currentModel[i].infoGain();validModels++;      }     } else {      averageInfoGain = averageInfoGain+currentModel[i].infoGain();      validModels++;    }}else  currentModel[i] = null;      }            // Check if any useful split was found.      if (validModels == 0)return noSplitModel;      averageInfoGain = averageInfoGain/(double)validModels;      // Find "best" attribute to split on.      minResult = 0;      for (i=0;i
 
  = (averageInfoGain-1E-3)) &&      Utils.gr(currentModel[i].gainRatio(),minResult)){     bestModel = currentModel[i];    minResult = currentModel[i].gainRatio();  }       }      // Check if useful split was found.      if (Utils.eq(minResult,0))return noSplitModel;            // Add all Instances with unknown values for the corresponding      // attribute to the distribution for the model, so that      // the complete distribution is stored with the model.       bestModel.distribution().  addInstWithUnknown(data,bestModel.attIndex());            // Set the split point analogue to C45 if attribute numeric.      if (m_allData != null)bestModel.setSplitPoint(m_allData);      return bestModel;    }catch(Exception e){      e.printStackTrace();    }    return null;  }
 
The first part mainly defines some local variables.

Double minResult; // the smallest information gain rate double currentResult; // The current information gain rate C45Split [] currentModel; // stores all models generated by unclassified attributes C45Split bestModel = null; // currently, the best model NoSplit noSplitModel = null; // represents the Model double averageInfoGain = 0; // The average information gain of each model (currentModel) int validModels = 0; // valid model boolean multiVal = true; // multi-value Distribution checkDistribution; // training dataset Distribution Attribute attribute; // Attribute column set double sumOfWeights; // weight and int I of the Training dataset; // cyclic variable

The second part is recursive exit.

 checkDistribution = new Distribution(data);      noSplitModel = new NoSplit(checkDistribution);      if (Utils.sm(checkDistribution.total(),2*m_minNoObj) ||  Utils.eq(checkDistribution.total(),   checkDistribution.perClass(checkDistribution.maxClass())))return noSplitModel;
As you can see, if the number of Current datasets is less than 2 * m_minNoObj (this value is 2 by default), or the current dataset is already in the same category, noSplitModel is returned, indicating no score is required, this is the condition for the entire C45 Classification Tree node to stop splitting.

Part 3: Determine whether it is multi-value:

      if (m_allData != null) {Enumeration enu = data.enumerateAttributes();while (enu.hasMoreElements()) {  attribute = (Attribute) enu.nextElement();  if ((attribute.isNumeric()) ||      (Utils.sm((double)attribute.numValues(),(0.3*(double)m_allData.numInstances())))){    multiVal = false;    break;  }}      } 
If any column in the attribute is numeric or its value is less than the number of training sets * 0.3, it is not a multi-value; otherwise, it is processed by multi-value. Whether or not multiple values affect some logic.

The fourth part is to construct a Spliter for each attribute column.

    for (i = 0; i < data.numAttributes(); i++){// Apart from class attribute.if (i != (data).classIndex()){    // Get models for current attribute.  currentModel[i] = new C45Split(i,m_minNoObj,sumOfWeights);  currentModel[i].buildClassifier(data);    // Check if useful split for current attribute  // exists and check for enumerated attributes with   // a lot of values.  if (currentModel[i].checkModel())    if (m_allData != null) {      if ((data.attribute(i).isNumeric()) ||  (multiVal || Utils.sm((double)data.attribute(i).numValues(),(0.3*(double)m_allData.numInstances())))){averageInfoGain = averageInfoGain+currentModel[i].infoGain();validModels++;      }     } else {      averageInfoGain = averageInfoGain+currentModel[i].infoGain();      validModels++;    }}else  currentModel[i] = null;      }

For each attribute column, if it is not worth storing the classification, construct a C45Split object, classify the object, calculate the information gain, and add it to averageInfoGain. For the C45Split structure, let's look at it later.

The fifth part selects the optimal model.

 if (validModels == 0)return noSplitModel;      averageInfoGain = averageInfoGain/(double)validModels;      // Find "best" attribute to split on.      minResult = 0;      for (i=0;i
 
  = (averageInfoGain-1E-3)) &&      Utils.gr(currentModel[i].gainRatio(),minResult)){     bestModel = currentModel[i];    minResult = currentModel[i].gainRatio();  } 
 

If a valid model exists, select the valid model. Note that the logic for selecting the optimal model is not simply to select the largest gainRatio, but must be greater than the average information gain, which is also different from the traditional c45 algorithm.

From the above process, Weka made a small change when implementing C45, and did not find the most reasonable column segmentation attribute from the "not used" attribute column, instead, find the most reasonable column in "all columns" as the segmentation attribute, although the two are definitely equivalent in terms of results (there was a previous discrepancy in attributes and there could be a good information gain rate), I personally reserve opinions on Weka's practice in terms of efficiency.


Ii. C45Spliter

In ModelSelection, C45Spliter is used to split the training set based on attributes and calculate the information gain and information gain rate. First, it starts with the buildClassifier method.

public void buildClassifier(Instances trainInstances)        throws Exception {    // Initialize the remaining instance variables.    m_numSubsets = 0;    m_splitPoint = Double.MAX_VALUE;    m_infoGain = 0;    m_gainRatio = 0;    // Different treatment for enumerated and numeric    // attributes.    if (trainInstances.attribute(m_attIndex).isNominal()) {      m_complexityIndex = trainInstances.attribute(m_attIndex).numValues();      m_index = m_complexityIndex;      handleEnumeratedAttribute(trainInstances);    }else{      m_complexityIndex = 2;      m_index = 0;      trainInstances.sort(trainInstances.attribute(m_attIndex));      handleNumericAttribute(trainInstances);    }  }    
We can see that the enumerated and numeric attributes are processed separately. The enumerated type calls handlEnumeratedAttribute and the numeric type calls handleNumericAttribute. It is worth noting that before the numeric type is processed, the corresponding columns are sorted, set m_complexityIndex, that is, the number of nodes to be split, to 2.

First, let's look at how enumeration types are handled.

private void handleEnumeratedAttribute(Instances trainInstances)       throws Exception {        Instance instance;    m_distribution = new Distribution(m_complexityIndex,      trainInstances.numClasses());        // Only Instances with known values are relevant.    Enumeration enu = trainInstances.enumerateInstances();    while (enu.hasMoreElements()) {      instance = (Instance) enu.nextElement();      if (!instance.isMissing(m_attIndex))m_distribution.add((int)instance.value(m_attIndex),instance);    }        // Check if minimum number of Instances in at least two    // subsets.    if (m_distribution.check(m_minNoObj)) {      m_numSubsets = m_complexityIndex;      m_infoGain = infoGainCrit.splitCritValue(m_distribution,m_sumOfWeights);      m_gainRatio = gainRatioCrit.splitCritValue(m_distribution,m_sumOfWeights,     m_infoGain);    }  }
The general process is to create a new distribution and traverse all instances. If the split attribute corresponding to the instance is not empty, put it in different bags and check whether the distribution meets the requirements, the requirement is that the maximum number of data in a bag is allowed to be less than m_minNoObj. If the data passes the check, set the number of subsets and calculate the information gain and information gain rate. Otherwise, the default value of subset is 0, when the checkModel is called on the upper layer, false is returned, indicating that this is an invalid model.

Next, let's take a look at how the numeric type is handled:

Private void handleNumericAttribute (Instances trainInstances) throws Exception {int firstMiss; // The subscript int next = 1 for the last valid instance; // The index int last = 0 for the next instance; // The index int splitIndex of the current instance =-1; // split point double currentInfoGain; // The current information gain double defaultEnt; // The information entropy before the Split double minSplit; Instance instance; int I;
// Create a new distribution. By default, the numeric distribution is processed as a two-dimensional distribution, which means that the value smaller than a specific value is placed in a Bag, and the rest are placed in another Bag.
    m_distribution = new Distribution(2,trainInstances.numClasses());    Enumeration enu = trainInstances.enumerateInstances();    i = 0;
// Note that the input instances is sorted in order, which ensures that missingValue is placed at the end. Therefore, when missingValue is read, it must be miss // ingValue. In other words, firstMiss indicates the subscript of the last valid instance after the loop.
While (enu. hasMoreElements () {instance = (Instance) enu. nextElement (); if (instance. isMissing (m_attIndex) break; m_distribution.add (1, instance); I ++;} firstMiss = I; // After the loop ends, m_distribution contains all valid instances, and put all in bag1.

// MinSplit is the minimum data volume in each Bag, that is, the mean value of 0.1 * each class. MinSplit = 0.1 * (m_distribution.total ()/(double) trainInstances. numClasses (); if (Utils. smOrEq (minSplit, m_minNoObj) minSplit = m_minNoObj; else if (Utils. gr (minSplit, 25) minSplit = 25; // if the total valid data volume is less than 2 * minSplit, in other words, the number of two bags cannot be greater than minSplit, directly. If (Utils. sm (double) firstMiss, 2 * minSplit) return; // defaultEnt represents the old information entropy, that is, the information entropy corresponding to Indexclass before this attribute is classified. DefaultEnt = infoGainCrit. oldEnt (m_distribution); while (next <firstMiss) {if (trainInstances. instance (next-1 ). value (m_attIndex) + 1e-5 <trainInstances. instance (next ). value (m_attIndex )){
// Records in Instances are sorted in ascending order. By default, this condition treats an Instance with a small value difference as the same instance.
// Last indicates the current record and next indicates the next record. The default value is next = 1 and last = 0. Therefore, shiftRange can be interpreted as moving the current record from bag1 to bag0.
// Note that all objects are in bag1 during initial initialization.
M_distribution.shiftRange (1, 0, trainInstances, last, next); if (Utils. grOrEq (m_distribution.perBag (0), minSplit) & // if both bags meet the minimum number of datasets minSplit Utils. grOrEq (m_distribution.perBag (1), minSplit) {currentInfoGain = infoGainCrit. splitCritValue (m_distribution, m_sumOfWeights, // calculate the information gain defaultEnt );
If (Utils. gr (currentInfoGain, m_infoGain) {m_infoGain = currentInfoGain; // if the information gain is greater than the current maximum, the current maximum value is replaced, record splitIndex = next-1;} m_index ++;} last = next;} next ++;} if (m_index = 0) return; // The execution result indicates that a proper split point is not found and the result is returned directly. // Calculate the optimal information gain m_infoGain = m_infoGain-(Utils. log2 (m_index)/m_sumOfWeights); if (Utils. smOrEq (m_infoGain, 0) return; // if the information gain is 0, it means that the appropriate split point is not found and the result is returned directly. // The rest is the attribute division based on the split point. M_numSubsets = 2; m_splitPoint = (trainInstances. instance (splitIndex + 1 ). value (m_attIndex) + trainInstances. instance (splitIndex ). value (m_attIndex)/2; // In case we have a numerical precision problem we need to choose the // smaller value if (m_splitPoint = trainInstances. instance (splitIndex + 1 ). value (m_attIndex) {m_splitPoint = trainInstances. instance (splitIndex ). value (m_attIndex);} // Restore distributioN for best split. m_distribution = new Distribution (2, trainInstances. numClasses (); m_distribution.addRange (0, trainInstances, 0, splitIndex + 1); m_distribution.addRange (1, trainInstances, splitIndex + 1, firstMiss ); // Compute modified gain ratio for best split. m_gainRatio = gainRatioCrit. splitCritValue (m_distribution, m_sumOfWeights, m_infoGain );}
This function is a bit complicated, and the specific logic is also written into the code comment.


Iii. BinC45ModelSelection

This function is only responsible for generating binary classification tree models. The selectModel method is almost the same as C45ModelSelection. The difference is that it uses BinC45Spliter instead of C45Spliter.


Iv. BinC45Spliter

HandleNumericAttribute treats numeric attributes exactly the same as C45Spliter. The following is an analysis of handleEnumeratedAttribute.

 private void handleEnumeratedAttribute(Instances trainInstances)       throws Exception {        Distribution newDistribution,secondDistribution;    int numAttValues;    double currIG,currGR;    Instance instance;    int i;    numAttValues = trainInstances.attribute(m_attIndex).numValues();    newDistribution = new Distribution(numAttValues,       trainInstances.numClasses());        // Only Instances with known values are relevant.    Enumeration enu = trainInstances.enumerateInstances();    while (enu.hasMoreElements()) {      instance = (Instance) enu.nextElement();      if (!instance.isMissing(m_attIndex))newDistribution.add((int)instance.value(m_attIndex),instance);    }    m_distribution = newDistribution;    // For all values    for (i = 0; i < numAttValues; i++){      if (Utils.grOrEq(newDistribution.perBag(i),m_minNoObj)){secondDistribution = new Distribution(newDistribution,i);// Check if minimum number of Instances in the two// subsets.if (secondDistribution.check(m_minNoObj)){  m_numSubsets = 2;  currIG = m_infoGainCrit.splitCritValue(secondDistribution,       m_sumOfWeights);  currGR = m_gainRatioCrit.splitCritValue(secondDistribution,m_sumOfWeights,currIG);  if ((i == 0) || Utils.gr(currGR,m_gainRatio)){    m_gainRatio = currGR;    m_infoGain = currIG;    m_splitPoint = (double)i;    m_distribution = secondDistribution;  }}      }    }
As you can see, the previous Code builds a new secondeDistribution based on the existing distribution based on different values of this attribute,
secondDistribution = new Distribution(newDistribution,i);
This distribution contains two columns with The subscripts I and the rest. Based on this distribution, information gain and information gain rate are calculated and the optimal one is selected.

In other words, the binary processing of discrete value classification is to select one column as one branch, and the rest as another branch. Although this is definitely not the best choice in terms of structure, it is easy to use.


The two modelselections of J48 are basically analyzed here. The next article will analyze the classifierInstance process and give a simple summary.






Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.