In the summary of integrated learning principles, we discussed that there are two schools of integration learning, one is the boosting faction, it is characterized by the various weak learners have a dependency relationship. The other is the bagging genre, which is characterized by the lack of dependency between the various weak learners and the ability to fit together in parallel. This paper summarizes the bagging and random forest algorithms in integrated learning.
Random forest is an algorithm that can compete with gradient ascending tree GBDT in integrated learning, especially because it can be easily used in parallel training, which is very tempting in the age of large data samples today.
1. Principle of bagging
In the summary of integrated learning principles, we have drawn the following schematic diagram for bagging.
As you can see, there is really no boosting link between bagging's weak learners. It is characterized by "random sampling". So what is random sampling?
Random sampling (BOOTSRAP) collects a fixed number of samples from our training set, but after each sample is collected, the sample is put back. In other words, previously collected samples may continue to be collected after being put back. For our bagging algorithm, we usually randomly collect and train samples with the same number of samples. This gets the same number of sample sets and training set samples, but the sample content is different. If we do random sampling of T-times for a training set of M samples, the T sampling set is different due to randomness.
Note that this is different from GBDT's sub-sampling. The GBDT sub-sample is a no-return sample, while the bagging sub-sample is a back-up sample.
For a sample, in a random sample of a training set with a M sample, the probability of each acquisition is $\frac{1}{m}$. The probability of not being collected is $1-\frac{1}{m}$. If the probability of M sampling is not being mined, it is $ (1-\frac{1}{m}) ^m$. When $m \to \infty$, $ (1-\frac{1}{m}) ^m \to \frac{1}{e} \simeq 0.368$. In other words, in the random sampling of bagging, about 36.8% of the data in the training set is not collected in the sampling set.
For some 36.8% of this data that is not sampled, we often call it out-of-pocket data (out of bag, OOB). These data are not fitted to the training set model and can therefore be used to detect the generalization capability of the model.
Bagging there is no limit to the weak learner, which is the same as AdaBoost. But the most commonly used are decision trees and neural networks.
Bagging's collection strategy is also relatively simple, for classification problems, usually using simple voting method, get the most votes of the category or one of the categories is the final model output. For regression problems, it is common to use the simple averaging method to get the final model output of the regression results obtained by T-weak learners.
Because the bagging algorithm is sampled every time to train the model, the generalization ability is very strong, which is useful for reducing the variance of the model. Of course, the degree of fitting to the training set will be worse, that is, the model bias will be larger.
2. Bagging algorithm Flow
In the previous section, we summarize the principles of the bagging algorithm, and make a summary of the flow of the bagging algorithm. The AdaBoost and gbdt,bagging algorithms are much simpler relative to the boosting series.
Input as Sample set $d=\{(X_,y_1), (x_2,y_2), ... (X_m,y_m) \}$, weak learner algorithm, weak classifier iteration number T.
Output is the final strong classifier $f (x) $
1) for t=1,2...,t:
A) a T random sampling of the training set, a total of M-times, to obtain a sample set containing M samples $d_m$
b) Training of the first m weak learners with the sampling set $d_m$ $g_m (x) $
2) If it is a classification algorithm prediction, then the class or category in which the T-weak learner throws the most votes is the final category. If it is a regression algorithm, the T-weak learners Get the results of the regression arithmetic averaging the resulting value for the final model output.
3. Random Forest algorithms
Understanding of the bagging algorithm, random forest (Forest, hereinafter referred to as RF) is good understanding. It is an evolutionary version of the bagging algorithm, that is to say, its thinking is still bagging, but it has made a unique improvement. Let's take a look at what the RF algorithm has improved.
First, RF uses the cart decision tree as a weak learner, which reminds us of the gradient hint tree gbdt. Second, on the basis of the decision tree, RF has improved the decision tree, for the general decision Tree, we will select an optimal feature in all n sample characteristics of the node to make the decision tree partition, but the RF by randomly select a portion of the node sample characteristics, this number is less than n, assuming that the $ n_{sub}$, then select an optimal feature to make the left and right sub-tree partition of the decision tree in these randomly selected $n_{sub}$ sample features. This further enhances the generalization capability of the model.
If $n_{sub} =n $, then there is no difference between the RF cart decision tree and the normal cart decision tree. The smaller the $n _{sub}$, the more robust the model is, and of course, the degree of fitting to the training set will become worse. In other words, the smaller the $n_{sub}$, the less the variance of the model, but the bias will increase. In a real-world case, a suitable $n_{sub}$ value is generally obtained by cross-validation of the parameter.
In addition to the above two points, RF and ordinary bagging algorithm is not different, the following simple summary of the RF algorithm.
Input as Sample set $d=\{(X_,y_1), (x_2,y_2), ... (X_m,y_m) \}$, weak classifier iteration number T.
Output is the final strong classifier $f (x) $
1) for t=1,2...,t:
A) a T random sampling of the training set, a total of M-times, to obtain a sample set containing M samples $d_m$
b) using the sampling set $d_m$ training The decision tree Model $G_M (x) $, in the training decision tree Model node, in the node all the sample features selected part of the sample features, in these randomly selected partial sample features select an optimal feature to make decision tree of the left and right sub-tree division
2) If it is a classification algorithm prediction, then the class or category in which the T-weak learner throws the most votes is the final category. If it is a regression algorithm, the T-weak learners Get the results of the regression arithmetic averaging the resulting value for the final model output.
4. The promotion of random forests
Due to the good characteristics of RF in practical application, based on RF, there are many variants of the algorithm, the application is also very broad, not only for classification regression, but also for feature conversion, anomaly detection and so on. Here is a summary of the typical algorithms for these RF families.
4.1 Extra trees
Extra trees is a variant of RF, the principle of almost identical to the RF, only the difference is:
1) In Decision tree node Division decision-making, RF uses a random selection of a subset of features to select the division features, and extra trees is more in line with the bagging tradition, based on all the characteristics to select the partitioning feature.
2) After selecting the partitioning feature, the decision tree of RF will choose an optimal eigenvalue dividing point based on the principle of information gain, Gini coefficient and mean variance, which is the same as the traditional decision tree. But extra trees is more aggressive, he randomly chooses a feature value to divide the decision tree.
As can be seen from the 2nd, because of the random selection of the eigenvalues of the dividing point, rather than the optimal point bit, this will result in the resulting decision tree size will generally be larger than the decision tree generated by RF. In other words, the variance of the model is further reduced relative to the RF, but the bias is further increased relative to the RF. At some point, extra trees has a better generalization capability than RF.
4.2 Totally Random Trees embedding
Totally Random Trees Embedding (hereinafter referred to as Trte) is an unsupervised learning method for data conversion. It maps low-dimensional datasets to high dimensions, allowing data that is mapped to high-dimensional to be better applied to categorical regression models. We know that a kernel method is used in support vector machines to map low-dimensional datasets to high-dimensional, and here Trte provides another way.
Trte also uses an RF-like method to build a T-decision tree to fit data in the process of data conversion. When the decision tree is established, the location of each data in the data set in the T decision Tree is also determined by the position of the leaf node. For example, we have 3 decision trees, each decision tree has 5 leaf nodes, a data feature $x$ divides the 2nd leaf node of the first decision tree, the 3rd leaf node of the second decision tree, and the 5th leaf node of the third decision tree. The X-mapped feature is encoded as (0,1,0,0,0, 0,0,1,0,0, 0,0,0,0,1) with 15-dimensional high-dimensional features. A space is added between the feature dimensions to emphasize the respective sub-encodings of the three decision trees.
After mapping to a high-dimensional feature, you can continue to use various categorical regression algorithms for supervised learning.
4.3 Isolation Forest
Isolation Forest (hereinafter referred to as Iforest) is an anomaly detection method. It also uses an RF-like method to detect anomaly points.
For a sample set in the T decision Tree, Iforest also randomly samples the training set, but the number of samples does not need to be the same as RF, and for RF, the number of samples to the sample set is equal to the number of training sets. But Iforest does not need to sample so much, in general, the number of samples is much smaller than the number of training sets? Why is it? Because our goal is anomaly detection, only a subset of the samples we can generally distinguish the anomaly points out.
For the establishment of each decision tree, Iforest chooses a partition feature randomly and selects a dividing threshold randomly. This is also different from RF.
In addition, Iforest will generally choose a relatively small maximum decision tree depth max_depth, the same reason for this collection, with a small number of anomaly detection generally do not need such a large-scale decision tree.
The judgment of the anomaly is to fit the test sample point $x$ to the T decision Tree. Calculates the depth $h_t (x) $ for the leaf node of the sample on each decision tree. So that the average height of h (x) can be calculated. At this point we use the following formula to calculate the anomaly probability of the sample point $x$: $ $s (x,m) = 2^{-\frac{h (x)}{c (m)}}$$
where M is the number of samples. The expression $c (m) $ is: $ $c (m) =2in (m-1) + \xi-2\frac{m-1}{m}, \; \xi is Euler constant $$
The value range of S (x,m) is [0,1], and the closer the value is to 1, the greater the probability of the anomaly.
5. Random Forest Summary
The RF algorithm principle is finally finished, as a highly parallelized algorithm, RF in big data time promising. A summary of the advantages and disadvantages of conventional random forest algorithms is also presented here.
The main advantages of RF are:
1) training can be highly parallelized and has an advantage over large sample training speeds in the big data age. Personally feel that this is the most important advantage.
2) Because the decision tree node partitioning feature can be randomly selected, the model can still be trained efficiently when the sample feature dimension is high.
3) After training, you can give the importance of each feature to the output
4) due to random sampling, the variance of the trained model is small and the generalization ability is strong.
5) RF implementations are relatively simple compared to the adaboost and GBDT of the boosting series.
6) Insensitive to partial feature deletions.
The main disadvantages of RF are:
1) In some noisy sample sets, the RF model tends to fall into overfitting.
2 It is easy to make more influence on the decision of RF, so the effect of fitting model can be influenced by the more characteristics of the value division.
(Welcome reprint, reproduced please indicate the source.) Welcome to communicate: [email protected])
Summary of principles of bagging and random forest algorithms