WEKA algorithm classifier-trees-reptree source code analysis (2)

Source: Internet
Author: User

(Part 1)


I. pruning process

The previous article analyzed the tree node construction process. If the pruning option is set after reptree. buildclassifier, there is also a pruning and backfit process.

    if (!m_NoPruning) {      m_Tree.insertHoldOutSet(prune);      m_Tree.reducedErrorPrune();      m_Tree.backfitHoldOutSet();    }


Among them, insertholdoutset is to pass in the dataset used for pruning, not specific areas and code.

The key points are the reducederrorprune and backfitholdoutset processes.


Ii. Tree. inclucederrorprune

Protected double inclucederrorprune () throws exception {<span style = "white-space: pre"> </span> // This function returns an error message for this tree and Its subtree, if the number of instances returned by the enumeration type is incorrect, the sum of squares returned by the value type and the correct value // if it is a leaf node, no operation will be performed if (m_attribute =-1) {return m_holdouterror; // briefly describe how to calculate this error. When using <span style = "font-size: 18px;"> insertholdoutset to input data, it will be based on the distribution during the original training, to predict the class of the incoming data, and then compare it with the real class value based on the result, you will know if the score is correct </span>} // calculate the deviation of all subtree double errortree = 0; For (INT I = 0; I <m_successors.length; I ++) {errortree + = m_successors [I]. inclucederrorprune ();} If (errortree> = m_holdouterror) {m_attribute =-1; // If the subtree deviation is greater than its own deviation, the subtree has no meaning, remove it directly. M_successors = NULL; return m_holdouterror;} else {return errortree ;}}
It can be seen that this pruning process is much simpler than j48.


3. Tree. backfitholdoutset

protected void backfitHoldOutSet() throws Exception {            // Insert instance into hold-out class distribution      if (m_Info.classAttribute().isNominal()) {// Nominal caseif (m_ClassProbs == null) {  m_ClassProbs = new double[m_Info.numClasses()];}System.arraycopy(m_Distribution, 0, m_ClassProbs, 0, m_Info.numClasses());        for (int i = 0; i < m_HoldOutDist.length; i++) {          m_ClassProbs[i] += m_HoldOutDist[i];        }        if (Utils.sum(m_ClassProbs) > 0) {          Utils.normalize(m_ClassProbs);        } else {          m_ClassProbs = null;        }      } else {// Numeric case        double sumOfWeightsTrainAndHoldout = m_Distribution[1] + m_HoldOutDist[0];        if (sumOfWeightsTrainAndHoldout <= 0) {          return;        }if (m_ClassProbs == null) {  m_ClassProbs = new double[1];} else {          m_ClassProbs[0] *= m_Distribution[1];        }m_ClassProbs[0] += m_HoldOutDist[1];m_ClassProbs[0] /= sumOfWeightsTrainAndHoldout;      }            // The process is recursive      if (m_Attribute != -1) {        for (int i = 0; i < m_Successors.length; i++) {          m_Successors[i].backfitHoldOutSet();        }      }    }
It can be seen that the process of re-calculating the distribution of the original data based on the new data set and recursively calling backfit for the Child tree is no longer detailed comments on the code.


Iv. Comparison of reptree and j48

Both are classification trees. There are many differences between reptree and j48. The differences are described below.

1. Processing of sorting continuous values

When processing continuous values, j48 sorts each subset. reptree sorts all attributes in the mainstream process and generates indexes to be sent to the Tree node for processing.

Therefore, j48 takes a long time, while reptree occupies a large amount of memory (data quantity * Number of Data Attribute columns, therefore, we can also see that the reptree code is constantly explicitly empty to try to release the memory), which is a typical tradeoff of time and space.

2. Recursive exit conditions

There are five conditions for splitting and stopping j48,

(1) All instances belong to the same category (in selectmodel)

(2) The number of instances is less than 2 * minnoobj (in selectmodel)

(3) Information Gain rock 0 produced by a split (in selectmodel)

(4) When splitting nodes for discrete values, the number of instances in the bag that exceed one is smaller than minnoobj (in spliter)

(5) When the continuous value is split and calculated, the number of valid instances is less than 2 * minnoobj (in spliter)

There are four stop conditions for reptree

(1) The number of training sets is less than 2 * minnum

(2) If the enumerated type is

(3) If the value is of the numerical type, the variance is smaller than a given value.

(4) reach the maximum depth

We can see that the main difference is that reptree uses variance to determine whether the continuous value ends Split.

3. Node Selection Method

J48 uses information gain rate and reptree uses information gain

4. pruning and backfit

The pruning of j48 is complex and divided into two operations: collapse () and prune (). The pruning of reptree is logically only a collapse operation of j48, no more radical pruning strategies were proposed for sub-trees.

J48 does not have backfit, and reptree has backfit. This is because j48 does not rely on the distribution of sample sets for its own classifyinstance process, while j48 uses classifyinstance to call the base class process and needs to store a distribution by itself, backfit is used to prevent overfitting.







WEKA algorithm classifier-trees-reptree source code analysis (2)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.