Decision Trees Decision Tree
What is a decision tree
Input: Learning Set
Output: Classification yingying (decision tree)
An overview of decision tree algorithms
From the late 70 to the early 80, Quinlan developed the ID3 algorithm (iterative splitter)
Quinlan modified import ID3 algorithm, called C4.5 algorithm.
1984, a number of statisticians in the famous "Classification and regression tree" book proposed the CART algorithm
ID3 and cart appear almost the same time, causing the whirlwind of research decision Tree algorithm, so far many algorithms have been proposed
The core problem of the algorithm
What is the order in which to select variables (attributes)?
Where is the best separation point (continuous case)?
ID3 algorithm
Information gain calculation
Recursion + divide and conquer
On the basis of this method, the split attribute of child nodes is calculated recursively, and the whole decision tree can be obtained finally.
This method is called the ID3 algorithm , and there are other algorithms that can also produce decision trees
For feature attributes that are contiguous, you can use the ID3 algorithm so that the elements in D are sorted by feature attribute, and the midpoint of each of the two adjacent elements can be considered a potential split point.
starting with the first potential splitting point, splitting D and calculating the expected information for two sets, the point with the smallest expected information is called the best split point of the attribute, and its information is expected as information for this attribute .
C4.5 algorithm
The method of information gain tends to select a variable with more factor number first
Change of information gain import: Gain rate
Cart algorithm
Selecting variables using the Gini index
Pruning
The pruning of the cart
After pruning: First produces the complete decision tree, then import the row to cut. The opposite approach is to pre-prune
Cost complexity: A function of the number of leaf nodes (cut objects) and the error rate of the tree
If pruning can reduce the complexity of the cost, the implementation
Pruning set
How to evaluate classifier performance?
A combination method to improve the accuracy of classifier
Combination methods include: 裃 bag (bagging), lifting (boosting) and random forest
Generating several training sets based on the sampling of learning data sets
Generating several classifiers using the training set
Each classifier is import line forecast, by simple election majority, to determine the final ownership of the
Why the combination method can improve the classification accuracy rate?
Advantages of combinatorial algorithms
1. Can significantly improve the accuracy rate of discrimination
2. More robust to error and noise
3. Offset over- fitting to a certain extent
4. Suitable for parallel computing
Bag-Bagging algorithm
Explanation: There is no self-service sample put back sample
There is a back-sampling
Self-help sample (Bootstrap), Jiawei Han book, page No. 241
Advantages of bagged algorithms
Accuracy is significantly higher than any single classifier in the group
For larger noise, performance is not very poor, and is robust
Not easy to over fit
Lifting (boosting) algorithm idea
Tuples in the training set are assigned weights
Weight influence sampling, the greater the weight, the more likely to be extracted
Iterative training of several classifiers, the tuple that was incorrectly categorized in the previous classifier, is increased in weight so that it is more "concerned" in the classifier that is later established.
The final classification is also voted by all classifiers, voting weights depend on the accuracy of the classifier
AdaBoost algorithm
Advantages and disadvantages of the lifting algorithm
Can get a higher accuracy rate than bagging.
Easy to over fit
Stochastic forest (random Forest) algorithm
Composed of a number of decision tree classifiers (hence called "forests")
A single decision tree classifier is composed of random methods. First of all, the learning set is a self-service sample obtained from the original training focus through the back sampling.
Second, variables that do not construct the decision tree are also randomly drawn, and the number of arguments is usually significantly smaller than the number of available variables.
A single decision tree is calculated using the cart algorithm and does not prune after generating the learning set and determining the parameter invariants.
The final classification results depend on each decision tree classifier simple majority election
Advantages of random forest algorithms
The accuracy rate can be comparable with AdaBoost
More robust against errors and outlier points
the problem that the decision tree is prone to overfitting is weakened with the size of the forest.
Fast and good performance in big data situations
Machine Learning Week 5th-smelting number into gold-----decision tree, combined lifting algorithm, bagging and adaboost, random forest.