1. The difference between RF (random forest) and GBDT
Same point:
1) are made up of many trees.
2) The final result is determined by multiple trees.
Different points:
1) The tree that makes up the random forest can be either a classification tree or a regression tree, and GBDT only consists of a regression tree.
2) trees that make up random forests are generated in parallel, whereas GBDT can only be generated serially
3) The result of a random forest is multiple tree voting decisions, while the GBDT is more tree-tired plus and
4) Random forests are not sensitive to outliers, while GBDT are sensitive to outlier values
5) random forests improve performance by reducing the variance of the model, while GBDT is reducing the deviation of the model to improve performance
6) Random forests do not require data preprocessing, normalization, and GBDT need to be normalized
2. Difference between classification tree and regression tree
1) The classification tree uses the information gain or the gain ratio to divide the node; the category of each node determines the category of the test sample by the voting amount;
2) The regression tree uses the minimized mean variance to divide the nodes; the mean value of each node sample as the regression predictor of the test sample
3, the core of GBDT
1) The core of GBDT is that each tree is built on the absolute residuals of all the trees previously learned, and this residual is the sum of the true values after a predicted value.
4. The difference between Xgboost and GBDT
1) Traditional GBDT with cart as the base classifier, Xgboost also supports linear classifiers, at this time xgboost equivalent to logistic regression (categorical problem) or linear regression (regression problem) with L1 and L2 regularization items.
2) The method of node splitting is different, GBDT is used Gini coefficient, xgboost is after optimized derivation.
3) The traditional GBDT uses only the first derivative information in the optimization, and xgboost the cost function Taylor, and uses the first order and second derivative.
4) Xgboost added a regular term to the cost function to control the complexity of the model and reduce the likelihood of overfitting. The regular term contains the number of leaf nodes of the tree, the sum of the squares of the score L2 of the output on each leaf node.
5) shrinkage (reduction), equivalent to the learning rate (ETA in xgboost). Xgboost at the end of an iteration, the weight of the leaf node will be multiplied by the coefficient, mainly to weaken the impact of each tree, so that there is more space behind the study. (Gbdt also has a learning rate);
6) Column sampling. Xgboost uses random forest practices to support column sampling, not only to prevent overfitting, but also to reduce computation;
7) Handling of missing values. For the missing sample of the characteristic value, Xgboost can also learn its splitting direction automatically.
8) Xgboost tools support parallelism. Note that xgboost parallelism is not parallel to the tree granularity, and xgboost parallelism is on the feature granularity. We know that one of the most time-consuming steps in a decision tree is to sort the eigenvalues (because to determine the best segmentation), xgboost the data beforehand before training, then saves it as a block structure, which is used repeatedly in subsequent iterations to reduce the amount of computation. This block structure also makes parallelism possible, in the Division of the node, we need to calculate the gain of each feature, the final selection of the maximum gain of the feature to do the splitting, then the various features of the gain calculation can be open multi-threaded.
9) Xgboosst reference to random forest practices, support column sampling, not only can reduce overfitting, but also reduce the calculation, which is xgboost different from the traditional GBDT a feature.
10) Processing of missing values. For a missing sample of the feature's value, Xgboost can automatically learn the direction of his division.
1) Why Xgboost to use Taylor to expand, where is the advantage?
The xgboost uses first-and second-order biasing, and the second derivative facilitates a faster and more accurate gradient descent. Using Taylor Expansion acquisition function to do the second derivative form of the self-variable, it is possible to use the value of the input data to perform the optimization of the leaf splitting without selecting the specific form of the loss function, essentially separating the loss function selection and the model algorithm optimization/parameter selection. This decoupling increases the applicability of the xgboost, making it possible to select loss functions on demand, which can be used for classification or for regression
5, GBDT How to set a single tree stop growth conditions?
1) Minimum number of samples at node splitting
2) Maximum depth of the tree
3) Maximum leaf node points
4) Loss meet the constraint conditions
6, GBDT How to evaluate the weight of the eigenvalue value?
1) by calculating the information gain of each feature under the training set, the ratio of each feature information gain to the sum of all feature information gain is calculated as the weight value.
2) draw on the voting mechanism. The same GBDT parameter is used to train a model for each feature of W, then the number of correct classification of each feature is calculated under the model, and the ratio of the number of correct classification of each feature to the sum of the number of correct classification is weighted.
7. GBDT when the number of samples is increased, does the training duration increase linearly?
No, because when a single decision tree is generated, the minimum value of the loss function is not linearly related to the sample number n
8. When increasing the tree number, does the training duration increase linearly?
No, because the generation time complexity O (N) is different for each tree
9. When the number of leaves of a tree is increased, does the training duration increase linearly?
No, the complexity of the number of leaf nodes and the generation of each tree is not proportional to O (N).
10. What information is stored on each node?
The middle node holds the split value of a feature, and the leaf node holds the probability that the prediction is a category
11, how to prevent over-fit
1) Increase the sample to remove the noise
2) Reduce features and retain some important features
3) samples are sampled, that is, when building the tree, not all the samples as input, but select a subset of data as subsets
4) The characteristics of the sampling, and sample sampling is basically consistent, that is, each time the achievement of the only part of the characteristics of the segmentation
12. Which parts of GBDT can be paralleled
1) When calculating the negative gradient of each sample
2) when splitting and selecting the best features and their dividing points, the corresponding error and mean values are calculated for the features.
3) When updating the negative gradient of each sample
4) During the final prediction process, each sample accumulates the results of all previous trees
13, the tree to create a malformed tree, what harm will bring, how to prevent?
In the process of spanning tree, the constraints of tree imbalance are added. This constraint can be user-defined. For example, the sample is divided into a node, and the other node is a small sample of the situation to punish.
RF, GBDT, xgboost common interview algorithm collation