WEKA algorithm classifier-meta-bagging source code analysis

Last Update:2014-09-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The bagging part is relatively simple, and the algorithm and code are written together.

I. Bagging Algorithm

Strictly speaking, bagging is not a classification algorithm. Like boosting, bagging is a combination of basic classifiers, that is, using multiple base classifiers to obtain more powerful classifiers, its core idea is sampling with replacement.

Training process of the Bagging Algorithm:

1. There are m samples from the sample set.

2. Use the m samples to train the base classifier C.

3. repeat this process X times to obtain several base classifiers.

The prediction process of the Bagging Algorithm:

1. For new input instance A, use the X new classifiers to obtain a list of classification results.

2. If the attribute to be classified is Numeric (regression), calculate the arithmetic average value of the list and return the result.

3. If the attribute to be classified is of the enumeration type (classification), vote for the classification result based on this list, and the highest number of votes is returned.

Ii. WEKA code implementation

(1) Base Classifier

The default base Classifier in WEKA uses reptree, that is, fast demo-tree learner. What is this? I will write an article later for analysis.

 public Bagging() {    m_Classifier = new weka.classifiers.trees.REPTree();  }

(2) Build Process buildclassifier

The whole buildclassifier is expanded around m_calcoutofbag. The m_calcoutofbag mark indicates whether to calculate the error ratio of outofbag.

If we sample the training set m and the number of samples is the same as that of M, some samples are certainly not drawn (why? Because there is sampling with replacement), This identifier is used to evaluate the accuracy of these undrawn samples. Without this label, the accuracy will not be calculated later.

if (m_CalcOutOfBag && (m_BagSizePercent != 100)) {      throw new IllegalArgumentException("Bag size needs to be 100% if "          + "out-of-bag error is to be calculated!");    }    int bagSize = data.numInstances() * m_BagSizePercent / 100;    Random random = new Random(m_Seed);    boolean[][] inBag = null;    if (m_CalcOutOfBag)      inBag = new boolean[m_Classifiers.length][];    for (int j = 0; j < m_Classifiers.length; j++) {      Instances bagData = null;      // create the in-bag dataset      if (m_CalcOutOfBag) {        inBag[j] = new boolean[data.numInstances()];        // bagData = resampleWithWeights(data, random, inBag[j]);        bagData = data.resampleWithWeights(random, inBag[j]);      } else {        bagData = data.resampleWithWeights(random);        if (bagSize < data.numInstances()) {          bagData.randomize(random);          Instances newBagData = new Instances(bagData, 0, bagSize);          bagData = newBagData;        }      }

This part is sampling. First, if m_calcoutofbag is used, the sampling ratio must be 100%.

Second, calculate the sample size.

The inbag array is used to record which samples in instances are extracted and which are not.

Data. resamplewithweight refers to sampling with replacement.

      if (m_Classifier instanceof Randomizable) {        ((Randomizable) m_Classifiers[j]).setSeed(random.nextInt());      }      // build the classifier      m_Classifiers[j].buildClassifier(bagData);

The next step is to construct the classification tree and call the buildclassifier method of the specific classifier.

The last step is to calculate the outofbag process. I have written comments to the code.

If (getcalcoutofbag () {// if this mark is set, double outofbagcount = 0.0 is calculated. // the weight of the error and double errorsum = 0.0 are calculated; // Boolean numeric = data. classattribute (). isnumeric (); // whether it is a continuous value for (INT I = 0; I <data. numinstances (); I ++) {double vote; // represents the voting result double [] votes; // represents the voting if (numeric) votes = new double [1]; // if it is a value, take the mean. An array unit is enough in the process of calculating the mean. Else votes = new double [data. numclasses ()]; // otherwise, a vote is required. // determine predictions for instance int votecount = 0; For (Int J = 0; j <m_classifiers.length; j ++) {If (inbag [J] [I]) continue; // if it has been sampled, ignore it because the calculation is outofbag votecount ++; // record how many samples are calculated if (numeric) {votes [0] = m_classifiers [J]. classifyinstance (data. instance (I); // numeric type, the prediction result is directly accumulated} else {

Double [] newprobs = m_classifiers [J]. distributionforinstance (data. instance (I); For (int K = 0; k <newprobs. length; k ++) {votes [k] + = newprobs [k]; // for Enumeration type, all enumeration probabilities must be accumulated.} // "Vote" If (numeric) {vote = votes [0]; If (votecount> 0) {vote/= votecount; // numeric value} else {If (utils. eq (utils. sum (votes), 0) {} else {utils. normalize (votes); // normalization} vote = utils. maxindex (votes); // select the largest index} Outofbagcount + = data. instance (I ). weight (); // accumulate the weight if (numeric) {errorsum + = strictmath. ABS (vote-data. instance (I ). classvalue () * data. instance (I ). weight (); // cumulative error deviation} else {If (vote! = Data. instance (I ). classvalue () errorsum + = data. instance (I ). weight (); // count errors if enumeration is used} m_outofbagerror = errorsum/outofbagcount; // obtain the last average value} else {m_outofbagerror = 0; // if there is no such mark, it will not be counted}

3. Sampling without replacement based on weight

That is, Data. resamplewithweights (random, inbag [J]). This method is quite interesting. Let's take a look at it.

There are three reload forms, and the first two will call the third one:

  public Instances resampleWithWeights(Random random, double[] weights) {    return resampleWithWeights(random, weights, null);  }

  public Instances resampleWithWeights(Random random, boolean[] sampled) {    double[] weights = new double[numInstances()];    for (int i = 0; i < weights.length; i++) {      weights[i] = instance(i).weight();    }    return resampleWithWeights(random, weights, sampled);  }

public Instances resampleWithWeights(Random random, double[] weights,    boolean[] sampled) {    if (weights.length != numInstances()) {      throw new IllegalArgumentException("weights.length != numInstances.");    }    Instances newData = new Instances(this, numInstances());    if (numInstances() == 0) {      return newData;    }    // Walker's method, see pp. 232 of "Stochastic Simulation" by B.D. Ripley    double[] P = new double[weights.length];    System.arraycopy(weights, 0, P, 0, weights.length);    Utils.normalize(P);    double[] Q = new double[weights.length];    int[] A = new int[weights.length];    int[] W = new int[weights.length];    int M = weights.length;    int NN = -1;    int NP = M;    for (int I = 0; I < M; I++) {      if (P[I] < 0) {        throw new IllegalArgumentException("Weights have to be positive.");      }      Q[I] = M * P[I];      if (Q[I] < 1.0) {        W[++NN] = I;      } else {        W[--NP] = I;      }    }    if (NN > -1 && NP < M) {      for (int S = 0; S < M - 1; S++) {        int I = W[S];        int J = W[NP];        A[I] = J;        Q[J] += Q[I] - 1.0;        if (Q[J] < 1.0) {          NP++;        }        if (NP >= M) {          break;        }      }      // A[W[M]] = W[M];    }    for (int I = 0; I < M; I++) {      Q[I] += I;    }    for (int i = 0; i < numInstances(); i++) {      int ALRV;      double U = M * random.nextDouble();      int I = (int) U;      if (U < Q[I]) {        ALRV = I;      } else {        ALRV = A[I];      }      newData.add(instance(ALRV));      if (sampled != null) {        sampled[ALRV] = true;      }      newData.instance(newData.numInstances() - 1).setWeight(1);    }    return newData;  }

This so-called

Walker's method, see pp. 232 of "Stochastic Simulation" by B.D. Ripley

I have been looking for a long time and I don't know what an algorithm is. I don't have any comments on the code. I don't understand it at all. I will try to add the algorithm of this function next time.

WEKA algorithm classifier-meta-bagging source code analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

WEKA algorithm classifier-meta-bagging source code analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

WEKA algorithm classifier-meta-bagging source code analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support