Machine Learning Week 5th-smelting number into gold-----decision tree, combined lifting algorithm, bagging and adaboost, random forest.

Last Update:2016-04-23 Source: Internet

Author: User

Tags id3

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Decision Trees Decision Tree

What is a decision tree
Input: Learning Set
Output: Classification yingying (decision tree)

An overview of decision tree algorithms

From the late 70 to the early 80, Quinlan developed the ID3 algorithm (iterative splitter)
Quinlan modified import ID3 algorithm, called C4.5 algorithm.
1984, a number of statisticians in the famous "Classification and regression tree" book proposed the CART algorithm
ID3 and cart appear almost the same time, causing the whirlwind of research decision Tree algorithm, so far many algorithms have been proposed

The core problem of the algorithm

What is the order in which to select variables (attributes)?
Where is the best separation point (continuous case)?

ID3 algorithm

Information gain calculation

Recursion + divide and conquer

On the basis of this method, the split attribute of child nodes is calculated recursively, and the whole decision tree can be obtained finally.
This method is called the ID3 algorithm , and there are other algorithms that can also produce decision trees
For feature attributes that are contiguous, you can use the ID3 algorithm so that the elements in D are sorted by feature attribute, and the midpoint of each of the two adjacent elements can be considered a potential split point.

starting with the first potential splitting point, splitting D and calculating the expected information for two sets, the point with the smallest expected information is called the best split point of the attribute, and its information is expected as information for this attribute .

C4.5 algorithm

The method of information gain tends to select a variable with more factor number first
Change of information gain import: Gain rate

Cart algorithm

Selecting variables using the Gini index

Pruning

The pruning of the cart

After pruning: First produces the complete decision tree, then import the row to cut. The opposite approach is to pre-prune
Cost complexity: A function of the number of leaf nodes (cut objects) and the error rate of the tree
If pruning can reduce the complexity of the cost, the implementation
Pruning set

How to evaluate classifier performance?

A combination method to improve the accuracy of classifier

Combination methods include: 裃 bag (bagging), lifting (boosting) and random forest
Generating several training sets based on the sampling of learning data sets
Generating several classifiers using the training set
Each classifier is import line forecast, by simple election majority, to determine the final ownership of the

Why the combination method can improve the classification accuracy rate?

Advantages of combinatorial algorithms

1. Can significantly improve the accuracy rate of discrimination
2. More robust to error and noise
3. Offset over- fitting to a certain extent
4. Suitable for parallel computing

Bag-Bagging algorithm

Explanation: There is no self-service sample put back sample

There is a back-sampling
Self-help sample (Bootstrap), Jiawei Han book, page No. 241

Advantages of bagged algorithms

Accuracy is significantly higher than any single classifier in the group
For larger noise, performance is not very poor, and is robust
Not easy to over fit

Lifting (boosting) algorithm idea

Tuples in the training set are assigned weights
Weight influence sampling, the greater the weight, the more likely to be extracted
Iterative training of several classifiers, the tuple that was incorrectly categorized in the previous classifier, is increased in weight so that it is more "concerned" in the classifier that is later established.
The final classification is also voted by all classifiers, voting weights depend on the accuracy of the classifier

AdaBoost algorithm

Advantages and disadvantages of the lifting algorithm

Can get a higher accuracy rate than bagging.
Easy to over fit

Stochastic forest (random Forest) algorithm

Composed of a number of decision tree classifiers (hence called "forests")
A single decision tree classifier is composed of random methods. First of all, the learning set is a self-service sample obtained from the original training focus through the back sampling.

Second, variables that do not construct the decision tree are also randomly drawn, and the number of arguments is usually significantly smaller than the number of available variables.
A single decision tree is calculated using the cart algorithm and does not prune after generating the learning set and determining the parameter invariants.
The final classification results depend on each decision tree classifier simple majority election

Advantages of random forest algorithms

The accuracy rate can be comparable with AdaBoost
More robust against errors and outlier points
the problem that the decision tree is prone to overfitting is weakened with the size of the forest.
Fast and good performance in big data situations

Machine Learning Week 5th-smelting number into gold-----decision tree, combined lifting algorithm, bagging and adaboost, random forest.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More