WEKA algorithm classifier-meta-bagging source code analysis

Source: Internet
Author: User


The bagging part is relatively simple, and the algorithm and code are written together.


I. Bagging Algorithm

Strictly speaking, bagging is not a classification algorithm. Like boosting, bagging is a combination of basic classifiers, that is, using multiple base classifiers to obtain more powerful classifiers, its core idea is sampling with replacement.


Training process of the Bagging Algorithm:

1. There are m samples from the sample set.

2. Use the m samples to train the base classifier C.

3. repeat this process X times to obtain several base classifiers.


The prediction process of the Bagging Algorithm:

1. For new input instance A, use the X new classifiers to obtain a list of classification results.

2. If the attribute to be classified is Numeric (regression), calculate the arithmetic average value of the list and return the result.

3. If the attribute to be classified is of the enumeration type (classification), vote for the classification result based on this list, and the highest number of votes is returned.


Ii. WEKA code implementation

(1) Base Classifier

The default base Classifier in WEKA uses reptree, that is, fast demo-tree learner. What is this? I will write an article later for analysis.

 public Bagging() {    m_Classifier = new weka.classifiers.trees.REPTree();  }


(2) Build Process buildclassifier

The whole buildclassifier is expanded around m_calcoutofbag. The m_calcoutofbag mark indicates whether to calculate the error ratio of outofbag.

If we sample the training set m and the number of samples is the same as that of M, some samples are certainly not drawn (why? Because there is sampling with replacement), This identifier is used to evaluate the accuracy of these undrawn samples. Without this label, the accuracy will not be calculated later.

if (m_CalcOutOfBag && (m_BagSizePercent != 100)) {      throw new IllegalArgumentException("Bag size needs to be 100% if "          + "out-of-bag error is to be calculated!");    }    int bagSize = data.numInstances() * m_BagSizePercent / 100;    Random random = new Random(m_Seed);    boolean[][] inBag = null;    if (m_CalcOutOfBag)      inBag = new boolean[m_Classifiers.length][];    for (int j = 0; j < m_Classifiers.length; j++) {      Instances bagData = null;      // create the in-bag dataset      if (m_CalcOutOfBag) {        inBag[j] = new boolean[data.numInstances()];        // bagData = resampleWithWeights(data, random, inBag[j]);        bagData = data.resampleWithWeights(random, inBag[j]);      } else {        bagData = data.resampleWithWeights(random);        if (bagSize < data.numInstances()) {          bagData.randomize(random);          Instances newBagData = new Instances(bagData, 0, bagSize);          bagData = newBagData;        }      }
This part is sampling. First, if m_calcoutofbag is used, the sampling ratio must be 100%.

Second, calculate the sample size.

The inbag array is used to record which samples in instances are extracted and which are not.

Data. resamplewithweight refers to sampling with replacement.

      if (m_Classifier instanceof Randomizable) {        ((Randomizable) m_Classifiers[j]).setSeed(random.nextInt());      }      // build the classifier      m_Classifiers[j].buildClassifier(bagData);
The next step is to construct the classification tree and call the buildclassifier method of the specific classifier.

The last step is to calculate the outofbag process. I have written comments to the code.

If (getcalcoutofbag () {// if this mark is set, double outofbagcount = 0.0 is calculated. // the weight of the error and double errorsum = 0.0 are calculated; // Boolean numeric = data. classattribute (). isnumeric (); // whether it is a continuous value for (INT I = 0; I <data. numinstances (); I ++) {double vote; // represents the voting result double [] votes; // represents the voting if (numeric) votes = new double [1]; // if it is a value, take the mean. An array unit is enough in the process of calculating the mean. Else votes = new double [data. numclasses ()]; // otherwise, a vote is required. // determine predictions for instance int votecount = 0; For (Int J = 0; j <m_classifiers.length; j ++) {If (inbag [J] [I]) continue; // if it has been sampled, ignore it because the calculation is outofbag votecount ++; // record how many samples are calculated if (numeric) {votes [0] = m_classifiers [J]. classifyinstance (data. instance (I); // numeric type, the prediction result is directly accumulated} else {
Double [] newprobs = m_classifiers [J]. distributionforinstance (data. instance (I); For (int K = 0; k <newprobs. length; k ++) {votes [k] + = newprobs [k]; // for Enumeration type, all enumeration probabilities must be accumulated.} // "Vote" If (numeric) {vote = votes [0]; If (votecount> 0) {vote/= votecount; // numeric value} else {If (utils. eq (utils. sum (votes), 0) {} else {utils. normalize (votes); // normalization} vote = utils. maxindex (votes); // select the largest index} Outofbagcount + = data. instance (I ). weight (); // accumulate the weight if (numeric) {errorsum + = strictmath. ABS (vote-data. instance (I ). classvalue () * data. instance (I ). weight (); // cumulative error deviation} else {If (vote! = Data. instance (I ). classvalue () errorsum + = data. instance (I ). weight (); // count errors if enumeration is used} m_outofbagerror = errorsum/outofbagcount; // obtain the last average value} else {m_outofbagerror = 0; // if there is no such mark, it will not be counted}


3. Sampling without replacement based on weight

That is, Data. resamplewithweights (random, inbag [J]). This method is quite interesting. Let's take a look at it.

There are three reload forms, and the first two will call the third one:

  public Instances resampleWithWeights(Random random, double[] weights) {    return resampleWithWeights(random, weights, null);  }

  public Instances resampleWithWeights(Random random, boolean[] sampled) {    double[] weights = new double[numInstances()];    for (int i = 0; i < weights.length; i++) {      weights[i] = instance(i).weight();    }    return resampleWithWeights(random, weights, sampled);  }


public Instances resampleWithWeights(Random random, double[] weights,    boolean[] sampled) {    if (weights.length != numInstances()) {      throw new IllegalArgumentException("weights.length != numInstances.");    }    Instances newData = new Instances(this, numInstances());    if (numInstances() == 0) {      return newData;    }    // Walker's method, see pp. 232 of "Stochastic Simulation" by B.D. Ripley    double[] P = new double[weights.length];    System.arraycopy(weights, 0, P, 0, weights.length);    Utils.normalize(P);    double[] Q = new double[weights.length];    int[] A = new int[weights.length];    int[] W = new int[weights.length];    int M = weights.length;    int NN = -1;    int NP = M;    for (int I = 0; I < M; I++) {      if (P[I] < 0) {        throw new IllegalArgumentException("Weights have to be positive.");      }      Q[I] = M * P[I];      if (Q[I] < 1.0) {        W[++NN] = I;      } else {        W[--NP] = I;      }    }    if (NN > -1 && NP < M) {      for (int S = 0; S < M - 1; S++) {        int I = W[S];        int J = W[NP];        A[I] = J;        Q[J] += Q[I] - 1.0;        if (Q[J] < 1.0) {          NP++;        }        if (NP >= M) {          break;        }      }      // A[W[M]] = W[M];    }    for (int I = 0; I < M; I++) {      Q[I] += I;    }    for (int i = 0; i < numInstances(); i++) {      int ALRV;      double U = M * random.nextDouble();      int I = (int) U;      if (U < Q[I]) {        ALRV = I;      } else {        ALRV = A[I];      }      newData.add(instance(ALRV));      if (sampled != null) {        sampled[ALRV] = true;      }      newData.instance(newData.numInstances() - 1).setWeight(1);    }    return newData;  }

This so-called
Walker's method, see pp. 232 of "Stochastic Simulation" by B.D. Ripley
I have been looking for a long time and I don't know what an algorithm is. I don't have any comments on the code. I don't understand it at all. I will try to add the algorithm of this function next time.









WEKA algorithm classifier-meta-bagging source code analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.