AdaBoost algorithm of R data analysis

Source: Internet
Author: User

Rattle implementation of AdaBoost algorithm

The boosting algorithm is a simple, efficient and easy-to-use modeling method. The AdaBoost (Adaptive lifting algorithm) is often referred to as the best-available classifier in the world.

The boosting algorithm builds multiple models using other weak learning algorithms, adds weights to objects that have a large impact on the data set, creates a series of models, and adjusts the weight values of the objects that affect the classification, in fact, the weight of the model from one model to another. The final model is composed of a series of models, each of which is given a weighted value based on the corresponding score. We note that if data fails or the weak classifier is too complex, it will cause boosting to fail.

Boosting some are similar to random forests, building a whole model, and the final model is better than any combination of weak classifiers. Separate from the random forest, build another tree and then refine it based on the previous model. After a model is established, any samples that have been incorrectly categorized are raised to the weight (boosted). An ascending sample is essentially highlighted in the data set, resulting in too many single-sample observations. The goal is to make the next model more effective in classifying this sample correctly, and if the sample is not properly divided, the samples will be raised again.

Compared with the random forest, the boosting algorithm tends to be more diversified, and any method of the model can be used as a learning algorithm, and the decision tree is a frequent algorithm.

1.boosting Overview

The boosting algorithm usually consists of a set of decision trees as the basic form of knowledge expression, and the key point of knowledge expression is the method of merging decision-making. For boosting, use the weight score (score), and each model corresponds to a weight.

2. Algorithms

As meta-learning, boosting uses a few simple learning algorithms to make up multiple models, and boosting often relies on weak learning algorithms-usually any weak classifier can be used. A series of weak classification models can form a strong classifier.

A weak classification is actually a little bit better than a random guessing error rate. But the combination will have a considerable classification effect.

The algorithm begins to build a weak initialization model based on the training data, then the wrong sample in the training data will be promoted (weight increment), and all samples will be given a weight value at the beginning, such as weight 1. Weights are promoted by a formula, so the weights of the sampled samples will be lifted (greater than 1).

Using these promoted samples to build a new model, we can use it as a sample of the problem, and then the model will pay attention to these sub-samples (large weights of the sample).

We can illustrate the process with a simple example. Assuming there are 10 samples, each sample has an initial weight, 0.1, we build a decision tree, there are four wrong samples (sample 7,8,9,10), we can calculate the weight of the wrong sample and 0.4 (usually we use e). This is the measurement of the model accuracy rate. E is used to update the weight of the measured value, the transformed value A=0.5*log ((1-E)/E), the wrong sample of the new weight value will be the EA, our example, a=0.2027, the sample 7,8,9,10 the new weight value will be 0.1*ea, (0.1225)

New models such as the wrong sub-samples, 1 and 8, they now weigh 0.1 and 0.1225, the new E is 0.2225, the new a value of 0.6275, so the weight of the sample 1 becomes 0.1*ea, (0.1869). The weight of sample 8 is 0.1225*ea (0.229). We can see that the weight of the sample 8 is now further increased, and the program continues to execute until the error rate of the single tree is greater than 50%.

3. Experimental examples

Building a model using rattle

There is a boost option in the Model toolbar, and a separate decision tree is established using Rpart. Creates a model of the result information to print to the text viewport. Use the Weather data set (click the Execute button on the data bar to load automatically).

The text viewport starts out with some of the functions that make up the model:

In the call basic information for the text viewport:

The model predictor variable is a raintomorrow,data representation of the basic data information, the contol= parameter is directly passed to Rpart (), and iter= is the number of trees established. Loss is the exponential loss function, iteration is the number of trees required to be established.

Performance Evaluation :

The confusion matrix shows the performance of the model and lists the correct predictions for the training data.

Train error is the model training errors Rate =1-(214+29)/(214+1+12+29) predict the correct sample/Total sample

The error rate of the Out-of-bag method and the corresponding number of iterations.

TRAIN.ERR1 train.kap1 48Variables actually used in tree construction:[1] "cloud3pm" "cloud9am" "Evaporation" "Humidity3 PM "[5]" humidity9am "" Maxtemp "" Mintemp "" pressure3pm "[9]" pressure9am "" Rainfall "" Sunshine "" temp3pm "[temp9am]" winddir3pm "winddir9am" "Windgustdir" [+] "windgustspeed" "windspeed3pm" "windspeed9am" Frequency of variables Actually used:winddir9am Windgustdir Sunshine winddir3pm pressure3pm all cloud3pm maxtemp mintemp Temp9am WindS PEED3PM 8 6 6 6 evaporation windgustspeed cloud9am humidity3pm humidity9am 5 5 3 3 2 pressure9am rainfall Temp3pm WindS PEED9AM 2 2 2 1Time taken:0.70 secs

Variables actually used in tree construction is the actual property used by the decision tree construction of the model.

Frequency of variables actually used is the frequency to which model attributes are used, listed from large to small.

Finally, it takes 0.7 seconds, because the amount of data is small, so it takes a little time.

Once the model is established, the Error button on the toolbar will be plotted as shown in the bug rate graph, and as more trees are added to the model, the error rate continues to decrease, starting to fall more quickly, and then slowly becoming flat.

The importance button draws the important properties of the model:

The Continue button in the lower right corner can continue to increase the number of trees for the training model.

AdaBoost algorithm of R data analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.