July Algorithm--December machine Learning online Class -11th lesson notes-random forest and ascension
July algorithm (julyedu.com) December machine Learning Online class study note http://www.julyedu.com
?
Random forest: Multiple trees, dividing the current node is the most important
1, decision Tree
Decision Tree Learning adopts a top-down recursive method, whose basic idea is to construct a tree with entropy as the fastest descending value.
The entropy at the leaf node is zero , and the instances in each leaf node belong to the same class.
?
The following focus is on choosing what entropy to drop the fastest
1.2, the decision Tree generation algorithm:
The key to establishing a decision tree is to select which attribute to use as the basis for classification in the current state.
According to different objective functions, the following three algorithms are established in decision tree.
ID3 C4.5 CART, Three kinds of learning ideas like
1.2.1 Information gain (ID3)
1, the concept: the corresponding entropy and conditional entropy are respectively called empirical entropy and empirical condition entropy.
Information gain: the degree to which the uncertainty of the information of Class X is reduced by indicating the information of feature a.
Definition: feature A to the information gain G (d,a) of the training data set D, defined as the empirical entropy H (d) of the set D and the empirical condition entropy H (d| A) The difference, namely:
is essentially computing H (D), H (d| A
2, Basic mark
3, the calculation method of information gain
Calculate Empirical entropy for DataSet D
?
Traverse all features, for feature a:
Calculates the empirical condition Entropy H (d|) for the data set D of the feature a A
Calculate the information gain of feature A: g (d,a) =h (D) –h (d| A
H (d| A) is calculated as follows:
Select the feature with the greatest information gain as the current split feature, which calculates the largest selection of each feature.
1.2.2,c4.5 (Information gain rate)
Information Gain Rate:
Gini coefficient:
Discussion on the coefficient of Gini
Second definition of Gini coefficients
The higher The information gain (rate) of an attribute, the greater the/gini exponent, indicating that the attribute has a stronger ability to reduce the entropy of the sample, which makes the data more capable of becoming deterministic from uncertainty .
1.3 Decision tree over-fitting
The decision tree has a good ability to classify the training, but the unknown test data may not have good classification ability, and the generalization ability is weak, that is, there may have been a fitting phenomenon.
Pruning and random forest means to prevent overfitting
A, bagging's strategy (plus a random sampling)
1,bootstrap Aggregation
2, resampling from sample set (duplicates) to select N Samples
3, on all attributes, create a classifier for the N samples (ID3, C4.5, CART, SVM, logistic regression
4, repeat the above two steps m, that is, to obtain a M classifier? Put the data on this m classifier,
5, finally, according to the voting results of the M classifier, determine which category the data belongs to.
B, Random Forest
Unlike bagging's strategy: (not done on all attributes, equivalent to increased randomness)
1, n samples were selected from sample concentration by bootstrap sampling;
2, randomly select K attributes from all attributes, select the best segmentation attribute as node point to establish the cart decision tree ;
3, repeat the above two steps m times, that is, the establishment of M-cart decision Tree
4, the M-cart forms a random forest, which, by voting results, determines which category of data belongs
1.4 Voting mechanisms
One possible scenario
1.5 Adaboost
The most critical of these are the coefficients of the GM (X), and the points to note:
For m=1,2,... M, made a m classifier
A, updating the weight distribution of the training data set
ZM is a normalization factor
B, construct a linear combination of basic classifiers
C, get the final classifier
If this classification is wrong, the next time, the weight will be increased, if the classification is correct, the weight will be reduced
July algorithm--December machine learning online Class-11th lesson notes-random forest and ascension