WEKA[1]-ID3 algorithm __ algorithm

Source: Internet
Author: User
Tags id3 split

We know that ID3 is one of the most basic decision tree algorithms. He was mainly divided by the selection of features according to Infogain, and no pruning was performed.

Buildclassifier:

  public void Buildclassifier (Instances data) throws Exception {

    //can classifier handle the data?
    Getcapabilities (). Testwithfail (data);

    Remove instances with missing class
    data = new instances (data);
    Data.deletewithmissingclass ();
    
    Recursive construction decision Tree
    maketree (data);
  }

There is nothing to write about, just look at the last line, maketree this function.

Maketree:

    Check If no instances has reached this node.
    If the tree is empty, returns if
    (data.numinstances () = = 0) {
      m_attribute = null;
      M_classvalue = Instance.missingvalue ();
      M_distribution = new double[data.numclasses ()];
      return;
    }

Verify that the tree at this time is empty, and if it is empty, the recursion is complete. The last split node is the leaf.

Compute attribute with maximum information gain.
    double[] Infogains = new double[data.numattributes ()];
    Enumeration attenum = Data.enumerateattributes ();
    while (Attenum.hasmoreelements ()) {
      Attribute att = (Attribute) attenum.nextelement ();
      Infogains[att.index ()] = computeinfogain (data, ATT);
    }
    M_attribute = Data.attribute (Utils.maxindex (infogains));

This is the infogain of each attribute, and the one that corresponds to the largest is chosen as the split attribute. Easy-to-understand code. (see computeinfogain this function later)

Make leaf if information gain is zero. 
    Otherwise create successors.
    if (Utils.eq (Infogains[m_attribute.index ()), 0)) {
      m_attribute = null;
      M_distribution = new double[data.numclasses ()];
      Enumeration instenum = Data.enumerateinstances ();
      while (Instenum.hasmoreelements ()) {
        Instance inst = (Instance) instenum.nextelement ();
        m_distribution[(int) inst.classvalue ()]++;
      }
      Utils.normalize (m_distribution);
      M_classvalue = Utils.maxindex (m_distribution);
      M_classattribute = Data.classattribute ();
    } else {
      instances[] Splitdata = splitdata (data, m_attribute);
      M_successors = new Id3[m_attribute.numvalues ()];
      for (int j = 0; J < M_attribute.numvalues (); j + +) {
        M_successors[j] = new Id3 ();
        M_successors[j].maketree (Splitdata[j]);
      }
    }

The first judgment is whether Infogain is 0 at this point, and if infogain=0, then it means that the node is already a leaf (because all the samples belong to the same class.) )。

So, the beginning of the calculation of m_distribution, in fact, this m_distribution is not much use, because the subtree of the sample must belong to the same class, the other classes are all 0.

If infogain!=0, it means continuing to classify. So, we already know the attribute to classify, and then just according to this attribute, the original data is divided into several parts (the attribute has several values, divided into several), and then recursively call Maketree. Stores all subtrees with M_successors.

Computeinfogain:

Private double Computeinfogain (Instances data, Attribute att) 
    throws Exception {

    double infogain = Computeentropy (data);
    If the ATT has a K-type value, it is divided into K-parts
    instances[] splitdata = splitdata (data, att);
    for (int j = 0; J < Att.numvalues (); j + +) {
      if (splitdata[j].numinstances () > 0) {
        Infogain-= ((double) SPL Itdata[j].numinstances ()/
                     (double) data.numinstances ()) *
          computeentropy (splitdata[j]);
      }
    }
    return infogain;
  }

This is also easy to understand, as long as you know the Infogain formula:

H (D) is the entropy of the attribute (this is not said)

computeentropy:

<span style= "FONT-SIZE:14PX;" >private Double computeentropy (Instances data) throws Exception {
    //statistic How many samples each class has,
    double [] classcounts = new do Uble[data.numclasses ()];
    Enumeration instenum = Data.enumerateinstances ();
    while (Instenum.hasmoreelements ()) {
      Instance inst = (Instance) instenum.nextelement ();
      classcounts[(int) inst.classvalue ()]++;
    }
    Double entropy = 0;
    for (int j = 0; J < data.numclasses (); j + +) {
      //classcounts equals 0, then this Part pi*log (PI) =0
      if (Classcounts[j] > 0) {
        entropy-= classcounts[j] * UTILS.LOG2 (classcounts[j]);}
    }
    The previous if does not contain the denominator, divided here by the denominator in the original formula
    entropy/= (double) data.numinstances ();
    return entropy + utils.log2 (data.numinstances ());
  } </span>

It's all in the comments.

Splitdata:

Private instances[] Splitdata (Instances data, Attribute att) {

    instances[] splitdata = new Instances[att.numvalues () ];
    for (int j = 0; J < Att.numvalues (), j + +) {
      //initialize, give data to subtree, this is not copied to splitdata!
      SPLITDATA[J] = new Instances (data, data.numinstances ());
    }
    Enumeration instenum = Data.enumerateinstances ();
    while (Instenum.hasmoreelements ()) {
      Instance inst = (Instance) instenum.nextelement ();
      Inst.value (ATT) returns the value
      splitdata[(int) inst.value (att)].add (inst) of the Inst to the expected property;
    }
    for (int i = 0; i < splitdata.length; i++) {
      splitdata[i].compactify ();
    }
    return splitdata;
  }

It's pretty straightforward here, creating a instances array, and then storing a subset of each pit. Here this inst.value (ATT) is somewhat incomprehensible, that is to say, he has converted the value of each attribute to 0-k.

The Compactify () is to change the information so that the data and info correspond.

Weka's ID3 is basically these functions, and then I have the biggest feeling that he is dealing with a limited number of data forms. Not yet found, how to deal with numeric code ~ ~ ~ Strange ...


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.