Weka algorithm Classifier-tree-RandomForest source code analysis (2) code implementation, randomforest

Source: Internet
Author: User

Weka algorithm Classifier-tree-RandomForest source code analysis (2) code implementation, randomforest



The implementation of RandomForest is exceptionally simple and simply beyond the expectation of the bloggers. Weka combines Bagging and RandomTree in the implementation mode.


1. RandomForest Training

The code for building RandomForest is as follows:

  public void buildClassifier(Instances data) throws Exception {    // can classifier handle the data?    getCapabilities().testWithFail(data);    // remove instances with missing class    data = new Instances(data);    data.deleteWithMissingClass();    m_bagger = new Bagging();    RandomTree rTree = new RandomTree();    // set up the random tree options    m_KValue = m_numFeatures;    if (m_KValue < 1)      m_KValue = (int) Utils.log2(data.numAttributes()) + 1;    rTree.setKValue(m_KValue);    rTree.setMaxDepth(getMaxDepth());    // set up the bagger and build the forest    m_bagger.setClassifier(rTree);    m_bagger.setSeed(m_randomSeed);    m_bagger.setNumIterations(m_numTrees);    m_bagger.setCalcOutOfBag(true);    m_bagger.buildClassifier(data);  }
This code intuitively shows that invalid data is removed first, and a Bag is created to set the attribute values used by each tree in the random forest and set the maximum depth, next, the RandomTree is passed to Bagging as a base classifier, and then the Training Method of bagging is called for training.


Ii. RandomForest Classification

After reading the training process, we can look at the specific classification process, that is, the classifyInstance function. It is worth noting that RandomForest inherits from Classifier, but it does not reload the classifyInstance method, the classifyInstance function of the base class Classifier is used, but the distributionForInstance function is overloaded. The distributionForInstance function is a function used by the Classifier classifyInstance function and returns the probability of an instance on all classes. The Code is as follows:

  public double[] distributionForInstance(Instance instance) throws Exception {    return m_bagger.distributionForInstance(instance);  }
We can see that the distribution of the given instance in each class is delegated to bagger (really lazy), so no detailed analysis is performed here. The detailed analysis is left when bagger is analyzed.

Next, let's take a look at how the base class Classifier uses distribution to give the classification result.

  public double classifyInstance(Instance instance) throws Exception {    double[] dist = distributionForInstance(instance);    if (dist == null) {      throw new Exception("Null distribution predicted");    }    switch (instance.classAttribute().type()) {    case Attribute.NOMINAL:      double max = 0;      int maxIndex = 0;      for (int i = 0; i < dist.length; i++) {        if (dist[i] > max) {          maxIndex = i;          max = dist[i];        }      }      if (max > 0) {        return maxIndex;      } else {        return Instance.missingValue();      }    case Attribute.NUMERIC:    case Attribute.DATE:      return dist[0];    default:      return Instance.missingValue();    }  }
We can intuitively see that if a classification is used, the maximum probability is given. If it is a regression (that is, the attribute corresponding to classIndex is a numerical value), the dist [0] is returned. here we use a convention. The first element represents the regression value.


Iii. Summary

The RandomForest code analysis is almost complete, and there is basically no substantive content, because the main work of the algorithm is done by Bagging and RandomForest. It is worth noting that, when the number of sampling attributes is not specified, Weka uses log2 (K) as the experience value.


The next blog will analyze Weka's RandomTree, and then analyze Bagging, so that RandomForest can be supplemented.




Recently I want to learn about data mining. Is there a weka-based clustering algorithm source code implementation?

I recently started to use data mining. I am studying kmeans. Due to the random changes in the kmeans center, the clustering results have some unreasonable changes, so I am trying to determine the initial center.

Weka & Data Mining: There are many algorithms in Weka, but Weka does not show how to implement the algorithms. Are there any related papers?

I think it is easier to understand the source code directly. WEKA is open-source.

If you want information gain, see this:

Www.360doc.com/...shtml

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.