Weka algorithm Classifier-tree-RandomForest source code analysis (2) code implementation, randomforest
The implementation of RandomForest is exceptionally simple and simply beyond the expectation of the bloggers. Weka combines Bagging and RandomTree in the implementation mode.
1. RandomForest Training
The code for building RandomForest is as follows:
public void buildClassifier(Instances data) throws Exception { // can classifier handle the data? getCapabilities().testWithFail(data); // remove instances with missing class data = new Instances(data); data.deleteWithMissingClass(); m_bagger = new Bagging(); RandomTree rTree = new RandomTree(); // set up the random tree options m_KValue = m_numFeatures; if (m_KValue < 1) m_KValue = (int) Utils.log2(data.numAttributes()) + 1; rTree.setKValue(m_KValue); rTree.setMaxDepth(getMaxDepth()); // set up the bagger and build the forest m_bagger.setClassifier(rTree); m_bagger.setSeed(m_randomSeed); m_bagger.setNumIterations(m_numTrees); m_bagger.setCalcOutOfBag(true); m_bagger.buildClassifier(data); }
This code intuitively shows that invalid data is removed first, and a Bag is created to set the attribute values used by each tree in the random forest and set the maximum depth, next, the RandomTree is passed to Bagging as a base classifier, and then the Training Method of bagging is called for training.
Ii. RandomForest Classification
After reading the training process, we can look at the specific classification process, that is, the classifyInstance function. It is worth noting that RandomForest inherits from Classifier, but it does not reload the classifyInstance method, the classifyInstance function of the base class Classifier is used, but the distributionForInstance function is overloaded. The distributionForInstance function is a function used by the Classifier classifyInstance function and returns the probability of an instance on all classes. The Code is as follows:
public double[] distributionForInstance(Instance instance) throws Exception { return m_bagger.distributionForInstance(instance); }
We can see that the distribution of the given instance in each class is delegated to bagger (really lazy), so no detailed analysis is performed here. The detailed analysis is left when bagger is analyzed.
Next, let's take a look at how the base class Classifier uses distribution to give the classification result.
public double classifyInstance(Instance instance) throws Exception { double[] dist = distributionForInstance(instance); if (dist == null) { throw new Exception("Null distribution predicted"); } switch (instance.classAttribute().type()) { case Attribute.NOMINAL: double max = 0; int maxIndex = 0; for (int i = 0; i < dist.length; i++) { if (dist[i] > max) { maxIndex = i; max = dist[i]; } } if (max > 0) { return maxIndex; } else { return Instance.missingValue(); } case Attribute.NUMERIC: case Attribute.DATE: return dist[0]; default: return Instance.missingValue(); } }
We can intuitively see that if a classification is used, the maximum probability is given. If it is a regression (that is, the attribute corresponding to classIndex is a numerical value), the dist [0] is returned. here we use a convention. The first element represents the regression value.
Iii. Summary
The RandomForest code analysis is almost complete, and there is basically no substantive content, because the main work of the algorithm is done by Bagging and RandomForest. It is worth noting that, when the number of sampling attributes is not specified, Weka uses log2 (K) as the experience value.
The next blog will analyze Weka's RandomTree, and then analyze Bagging, so that RandomForest can be supplemented.
Recently I want to learn about data mining. Is there a weka-based clustering algorithm source code implementation?
I recently started to use data mining. I am studying kmeans. Due to the random changes in the kmeans center, the clustering results have some unreasonable changes, so I am trying to determine the initial center.
Weka & Data Mining: There are many algorithms in Weka, but Weka does not show how to implement the algorithms. Are there any related papers?
I think it is easier to understand the source code directly. WEKA is open-source.
If you want information gain, see this:
Www.360doc.com/...shtml