Machine Learning Week 5th-smelting number into gold-----decision tree, combined lifting algorithm, bagging and adaboost, random forest.

Source: Internet
Author: User
Tags id3

Decision Trees Decision Tree

What is a decision tree
Input: Learning Set
Output: Classification yingying (decision tree)

An overview of decision tree algorithms

From the late 70 to the early 80, Quinlan developed the ID3 algorithm (iterative splitter)
Quinlan modified import ID3 algorithm, called C4.5 algorithm.
1984, a number of statisticians in the famous "Classification and regression tree" book proposed the CART algorithm
ID3 and cart appear almost the same time, causing the whirlwind of research decision Tree algorithm, so far many algorithms have been proposed

The core problem of the algorithm

What is the order in which to select variables (attributes)?
Where is the best separation point (continuous case)?

ID3 algorithm

Information gain calculation

Recursion + divide and conquer

On the basis of this method, the split attribute of child nodes is calculated recursively, and the whole decision tree can be obtained finally.
This method is called the ID3 algorithm , and there are other algorithms that can also produce decision trees
For feature attributes that are contiguous, you can use the ID3 algorithm so that the elements in D are sorted by feature attribute, and the midpoint of each of the two adjacent elements can be considered a potential split point.

starting with the first potential splitting point, splitting D and calculating the expected information for two sets, the point with the smallest expected information is called the best split point of the attribute, and its information is expected as information for this attribute .

C4.5 algorithm

The method of information gain tends to select a variable with more factor number first
Change of information gain import: Gain rate

Cart algorithm

Selecting variables using the Gini index

Pruning

The pruning of the cart

After pruning: First produces the complete decision tree, then import the row to cut. The opposite approach is to pre-prune
Cost complexity: A function of the number of leaf nodes (cut objects) and the error rate of the tree
If pruning can reduce the complexity of the cost, the implementation
Pruning set

How to evaluate classifier performance?

A combination method to improve the accuracy of classifier

Combination methods include: 裃 bag (bagging), lifting (boosting) and random forest
Generating several training sets based on the sampling of learning data sets
Generating several classifiers using the training set
Each classifier is import line forecast, by simple election majority, to determine the final ownership of the

Why the combination method can improve the classification accuracy rate?

Advantages of combinatorial algorithms

1. Can significantly improve the accuracy rate of discrimination
2. More robust to error and noise
3. Offset over- fitting to a certain extent
4. Suitable for parallel computing

Bag-Bagging algorithm

Explanation: There is no self-service sample put back sample

There is a back-sampling
Self-help sample (Bootstrap), Jiawei Han book, page No. 241

Advantages of bagged algorithms

Accuracy is significantly higher than any single classifier in the group
For larger noise, performance is not very poor, and is robust
Not easy to over fit

Lifting (boosting) algorithm idea

Tuples in the training set are assigned weights
Weight influence sampling, the greater the weight, the more likely to be extracted
Iterative training of several classifiers, the tuple that was incorrectly categorized in the previous classifier, is increased in weight so that it is more "concerned" in the classifier that is later established.
The final classification is also voted by all classifiers, voting weights depend on the accuracy of the classifier

AdaBoost algorithm

Advantages and disadvantages of the lifting algorithm

Can get a higher accuracy rate than bagging.
Easy to over fit

Stochastic forest (random Forest) algorithm

Composed of a number of decision tree classifiers (hence called "forests")
A single decision tree classifier is composed of random methods. First of all, the learning set is a self-service sample obtained from the original training focus through the back sampling.

Second, variables that do not construct the decision tree are also randomly drawn, and the number of arguments is usually significantly smaller than the number of available variables.
A single decision tree is calculated using the cart algorithm and does not prune after generating the learning set and determining the parameter invariants.
The final classification results depend on each decision tree classifier simple majority election

Advantages of random forest algorithms

The accuracy rate can be comparable with AdaBoost
More robust against errors and outlier points
the problem that the decision tree is prone to overfitting is weakened with the size of the forest.
Fast and good performance in big data situations

Machine Learning Week 5th-smelting number into gold-----decision tree, combined lifting algorithm, bagging and adaboost, random forest.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.