Use sklearn for integration learning-practice, sklearn Integration

Source: Internet
Author: User

Use sklearn for integration learning-practice, sklearn Integration
Series

  • Using sklearn for Integrated Learning-Theory
  • Using sklearn for Integrated Learning-Practice
Directory

1. Details about the parameters of Random Forest and Gradient Tree Boosting
2. How to adjust parameters?
2.1 adjustment objective: coordination of deviation and variance
2.2 Impact of parameters on overall model performance
2.3 A simple solution: greedy coordinate Descent Method
2.3.1 Random Forest parameter adjustment case: Digit Recognizer
2.3.1.1 parameters affecting the adjustment process
2.3.1.2 adjust the submodel impact parameters
2.3.2 example of Gradient Tree Boosting Parameter Adjustment: Hackathon3.x
2.3.2.1 parameters affecting the adjustment process
2.3.2.2 adjustment of submodel impact parameters
2.3.2.3 kill and remember a horse rifle
2.4 local optimal solution (Tip: Here is an egg!)
3. Conclusion
4 references

1. Details about the parameters of Random Forest and Gradient Tree Boosting

In the sklearn. ensemble library, we can find Random Forest classification and regression implementation: RandomForestClassifier and RandomForestRegression, Gradient Tree Boosting classification and regression implementation: GradientBoostingClassifier and regression. After having these models, how can we get started? Please stay tuned! Let me talk about the problems that often occur when using these models:

  • Obviously, the model is well adjusted, but the effect is a little different from what I think? -- The first step in model training is to set the goal. moving too much in the wrong direction also means moving backwards.
  • A parameter is called by intuition, but it does not have any effect, and sometimes it even has a reaction? -- After setting the target, determine which parameters affect the target, whether it has a positive or negative impact on the target, and the impact size.
  • I feel that the training is far from over. Is sklearn just a toy on small data? -- Although sklearn is not designed based on the distributed computing environment, we can still improve the training efficiency through some strategies.
  • The model has started training, but what step has it taken? -- If we are passionate about our lust, our goals, performance, and efficiency are met, we sometimes need to pursue other pursuits, such as the output of the training process and the out-of-bag score calculation.

By summarizing these common problems, we can divide model parameters into four categories: Target class, performance class, efficiency class, and additional class. The following table details the meanings of the four model parameters:

Parameters Type RandomForestClassifier RandomForestRegressor GradientBoostingClassifier GradientBoostingRegressor
Loss Target    

Loss Function

● Exponential: the model is equivalent to AdaBoost.

★Deviance: Same as the loss function of Logistic Regression

Loss Function

● Exponential: the model is equivalent to AdaBoost.

★Deviance: Same as the loss function of Logistic Regression

Alpha Target     When the loss function is huber or quantile, alpha is the parameter in the loss function. When the loss function is huber or quantile, alpha is the parameter in the loss function.
Class_weight Target

Category Weight

     
N_estimators Performance

Number of submodels

● Int: Number

★10: Default Value

Number of submodels

● Int: Number

★10: Default Value

Number of submodels

● Int: Number

★100: Default Value

Number of submodels

● Int: Number

★100: Default Value

Learning_rate Performance     Learning rate (reduction) Learning rate (reduction)
Criterion Performance

Calculation method used to determine whether nodes continue to be split

● Entropy

★Gini

Calculation method used to determine whether nodes continue to be split

★Mse

   
Max_features Performance

Maximum number of features involved in node splitting

● Int: Number

● Float: Percentage of all features

★Auto: the beginning of all feature numbers

● Sqrt: the beginning of all feature numbers

● Log2: log2 value of all feature numbers

● None: equal to the number of all features

Maximum number of features involved in node splitting

● Int: Number

● Float: Percentage of all features

★Auto: the beginning of all feature numbers

● Sqrt: the beginning of all feature numbers

● Log2: log2 value of all feature numbers

● None: equal to the number of all features

Maximum number of features involved in node splitting

● Int: Number

● Float: Percentage of all features

● Auto: the beginning of all feature numbers

● Sqrt: the beginning of all feature numbers

● Log2: log2 value of all feature numbers

★None: equal to the number of all features

Maximum number of features involved in node splitting

● Int: Number

● Float: Percentage of all features

● Auto: the beginning of all feature numbers

● Sqrt: the beginning of all feature numbers

● Log2: log2 value of all feature numbers

★None: equal to the number of all features

Max_depth Performance

Maximum depth. If the max_leaf_nodes parameter is specified, ignore

● Int: Depth

★None: the tree grows to all the leaves and is assigned to one class, or the number of samples represented by a node is smaller than min_samples_split.

Maximum depth. If the max_leaf_nodes parameter is specified, ignore

● Int: Depth

★None: the tree grows to all the leaves and is assigned to one class, or the number of samples represented by a node is smaller than min_samples_split.

Maximum depth. If the max_leaf_nodes parameter is specified, ignore

● Int: Depth

★3: Default Value

Maximum depth. If the max_leaf_nodes parameter is specified, ignore

● Int: Depth

★3: Default Value

Min_samples_split Performance

Minimum number of samples required for splitting

● Int: number of samples

★2: Default Value

 

Minimum number of samples required for splitting

● Int: number of samples

★2: Default Value

 

Minimum number of samples required for splitting

● Int: number of samples

★2: Default Value

 

Minimum number of samples required for splitting

● Int: number of samples

★2: Default Value

Min_samples_leaf Performance

Minimum number of leaf nodes

● Int: number of samples

★1: Default Value

Minimum number of leaf nodes

● Int: number of samples

★1: Default Value

Minimum number of leaf nodes

● Int: number of samples

★1: Default Value

Minimum number of leaf nodes

● Int: number of samples

★1: Default Value

Min_weight_fraction_leaf Performance

Minimum sample weight of leaf nodes

● Float: total weight

★0: Default Value

Minimum sample weight of leaf nodes

● Float: total weight

★0: Default Value

Minimum sample weight of leaf nodes

● Float: total weight

★0: Default Value

Minimum sample weight of leaf nodes

● Float: total weight

★0: Default Value

Max_leaf_nodes Performance

Maximum number of nodes

● Int: Number

★None: unlimited number of leaf nodes

Maximum number of nodes

● Int: Number

★None: unlimited number of leaf nodes

Maximum number of nodes

● Int: Number

★None: unlimited number of leaf nodes

Maximum number of nodes

● Int: Number

★None: unlimited number of leaf nodes

Bootstrap Performance

Sample sampling using bootstrap

● False: the samples of submodels are consistent and strongly correlated between submodels.

★True: Default Value

Sample sampling using bootstrap

● False: the samples of submodels are consistent and strongly correlated between submodels.

★True: Default Value

   
Subsample Performance    

Subsampling Rate

● Float: Sampling Rate

★1.0: Default Value

Subsampling Rate

● Float: Sampling Rate

★1.0: Default Value

Init Performance     Initial submodel Initial submodel
N_jobs Efficiency

Parallel lines

● Int: Number

●-1: Consistent with the number of CPU Cores

★1: Default Value

Parallel lines

● Int: Number

●-1: Consistent with the number of CPU Cores

★1: Default Value

   
Warm_start Efficiency

Whether it is hot-start. If yes, the next training is in the form of an append tree.

● Bool: Hot Start

★False: Default Value

Whether it is hot-start. If yes, the next training is in the form of an append tree.

● Bool: Hot Start

★False: Default Value

Whether it is hot-start. If yes, the next training is in the form of an append tree.

● Bool: Hot Start

★False: Default Value

Whether it is hot-start. If yes, the next training is in the form of an append tree.

● Bool: Hot Start

★False: Default Value

Presort Efficiency

 

  Whether pre-sorting or not. pre-sorting can accelerate the search for the best split point, regardless of sparse data.

● Bool

★Auto: Non-sparse data is pre-ordered. sparse data is not pre-ordered.

Whether pre-sorting or not. pre-sorting can accelerate the search for the best split point, regardless of sparse data.

● Bool

★Auto: Non-sparse data is pre-ordered. sparse data is not pre-ordered.

Oob_score Additional

Whether to calculate the out-of-bag score

★False: Default Value

Whether to calculate the out-of-bag score

★False: Default Value

   
Random_state Additional Random object Random object Random object Random object
Verbose Additional

Log Redundancy

● Int: redundant Length

★0: No output of the training process

● 1: Occasional output

●> 1: output to each submodel

Log Redundancy

● Int: redundant Length

★0: No output of the training process

● 1: Occasional output

●> 1: output to each submodel

Log Redundancy

● Int: redundant Length

★0: No output of the training process

● 1: Occasional output

●> 1: output to each submodel

Log Redundancy

● Int: redundant Length

★0: No output of the training process

● 1: Occasional output

●> 1: output to each submodel

#★: Default Value

It is not difficult to find that the Random Forest model based on bagging and the Gradient Tree boosting Model Based on Boosting have many common parameters. However, the default values of some parameters are quite different. In the article "using sklearn for Integrated Learning-theory", we have a preliminary understanding of the two Integrated Learning Technologies: bagging and boosting. All sub-models of Random Forest have low deviations. The training process of the overall model is to reduce the variance. Therefore, it requires a small number of sub-models (the default value of n_estimators is 10) the Submodel is not a weak model (the default value of max_depth is None). At the same time, reducing the correlation between submodels can reduce the variance of the overall model (the default value of max_features is auto ). On the other hand, the child model of Gradient Tree Boosting has a low variance. The training process of the overall model aims to reduce the deviation, so it requires a large number of child models (n_estimators default value is 100) the sub-model is a weak model (the default value of max_depth is 3), but reducing the correlation between sub-models cannot significantly reduce the variance of the overall model (the default value of max_features is None ).

 

2. How to adjust parameters?

Smart readers should ask: "bloggers, even if you list the meaning of each parameter, it's just a breeze! I still don't know what to do !"

The purpose of parameter classification is to narrow the Parameter Adjustment Scope. First, we need to specify the training goal and set the parameters of the target class. Next, we need to consider whether or not to adopt some strategies to improve the training efficiency based on the dataset size. Otherwise, one training will take place for three days and three nights, and all French children will be born. Then, we finally entered the most important step: adjust the parameters that affect the overall model performance.

2.1 adjustment objective: coordination of deviation and variance

Similarly, in "using sklearn for Integrated Learning-theory", we have discussed how deviations and variance affect model performance-accuracy. The purpose of parameter adjustment is to achieve the overall model deviation and variance harmony! Further, these parameters can be divided into two types: Process impact and submodel impact. When the sub-model remains unchanged, some parameters can affect the performance of the model by changing the training process, for example, "Number of sub-models" (n_estimators) and "learning rate" (learning_rate). In addition, we can change the performance of the sub-model to affect the overall performance of the model, such as "max_depth" and "criterion. Because the training process of bagging is designed to reduce the variance, And the boosting training process is designed to reduce the deviation, the parameters that affect the class in the process can cause a significant change in the overall model performance. In general, on this premise, we continue to fine-tune the sub-model to affect the class parameters, so as to further improve the performance of the model.

2.2 Impact of parameters on overall model performance

Assume that the model is a multivariate function F whose output value is the accuracy of the model. We can fix other parameters to analyze the impact of a parameter on the overall model performance: is it a positive or negative impact, and the monotonicity of the impact?

For Random Forest, increasing the number of submodels (n_estimators) can significantly reduce the variance of the overall model without affecting the deviation and variance of the submodel. The accuracy of the model increases with the increase in the number of submodels. The second item of the variance formula of the overall model is reduced, so there is an upper limit for accuracy improvement. In different scenarios, the "split condition" (criterion) has different effects on the accuracy of the model. This parameter needs to be flexibly adjusted in actual use. Adjust one of the "Maximum number of nodes" (max_leaf_nodes) and "Maximum Tree depth" (max_depth) to adjust the tree structure with coarse granularity: the more leaf nodes or the deeper the tree, this means that the lower the deviation of the submodel, the higher the variance. At the same time, adjust "min_samples_split" and "min_samples_leaf) and "min_weight_fraction_leaf", the tree structure can be adjusted in a more fine-grained manner: the fewer samples required for splitting or the fewer samples required for leaf nodes, it also means that the submodel is more complex. In general, bootstrap is used to subsample samples to reduce the correlation between submodels, thus reducing the variance of the overall model. Appropriately reduce "max_features", inject other randomness Into The Submodel, and reduce the correlation between submodels. However, it is not feasible to reduce this parameter blindly because the number of optional features decreases during splitting and the model deviation increases. In, we can see the impact of these parameters on the overall model performance of Random Forest:

For Gradient Tree Boosting, "n_estimators" and "learning_rate" must be adjusted in combination to improve the accuracy of the model as much as possible: imagine, solution A is to take four steps, 3 meters for each step, and solution B is to take five steps, 2 meters for each step. Which solution can be closer to the end point of 10 meters? Similarly, the more complex the submodel is, the lower the overall model deviation and the higher variance. Therefore, the "Maximum number of nodes" (max_leaf_nodes) and "decision tree depth" (max_depth) parameters that control the sub-model structure are consistent with those of Random Forest. Similar to max_features, lowering subsample also reduces the correlation between submodels and reduces the variance of the overall model, however, when the subsampling rate is low to a certain extent, the deviation of the submodel increases, which will reduce the accuracy of the overall model. Do you still remember what the "initial model" (init) is? Different loss functions have different initial model definitions. Generally, the initial model is a weaker model (predicted based on "average" conditions). Although custom loss functions are supported, in most cases, keep the default value. In, we can see the impact of these parameters on the overall performance of the Gradient Tree Boosting Model:

2.3 A simple solution: greedy coordinate Descent Method

So far, we finally know which parameters need to be adjusted. For a single parameter, we also know how to adjust it to improve performance. However, it indicates that the model function F is not a one-dimensional function, and these parameters need to be adjusted together to obtain the global optimal solution. That is to say, how about throwing these parameters to the parameter adjustment algorithm (such as Grid Search? For small datasets, we can still be so capricious, but the combination of parameters has exploded. on large datasets, maybe my son, sun, can see the training results. In fact, grid search does not necessarily obtain the global optimal solution, while other researchers try to solve the parameter adjustment problem from the perspective of solving the optimization problem.

The coordinate descent method is an optimization algorithm. Its biggest advantage is that it does not need to calculate the gradient of the target function to be optimized. We are most likely to think of a very simple method similar to the coordinate descent method. What is different from the coordinate descent method is that it does not use parameters cyclically for adjustment, instead, we select the parameters that have the greatest impact on the overall model performance. The influence of parameters on the overall model performance is dynamically changing. Therefore, this method performs a line search for the descent direction of each coordinate during each round of coordinate selection ). First, find the parameters that can improve the overall model performance, and then ensure that the increase is monotonous or approximately monotonous. This means that the filtered parameters have a positive impact on the overall model performance, and this impact is not accidental, the randomness of the training process also leads to a slight difference in the overall model performance, which is not monotonic. Finally, select the most influential parameters among the filtered parameters.

The overall model performance cannot be quantified, so we cannot compare the extent to which parameters affect the overall model performance. Yes, we do not have an accurate method to quantify the overall model performance. We can only use cross-validation to approximate the overall model performance. However, cross-validation is also random. If we take the average accuracy of the verification set as the overall model accuracy, We have to care about the coefficient of variation in the accuracy of each verification set. If the coefficient of variation is too large, the accuracy of the average value as the overall model is not suitable. In the following case analysis, the overall model performance we will talk about refers to the average accuracy. Please be careful.

2.3.1 Random Forest parameter adjustment case: Digit Recognizer

Here, we select Digit Recognizer in the 101 teaching competition on Kaggle as the case to demonstrate the process of adjusting parameters for RandomForestClassifier. Of course, we should not set different parameters manually and then train the model. Using the GridSearchCV class in the sklearn. grid_search library, you can not only automatically adjust parameters, but also perform cross-validation on each parameter combination to calculate the average accuracy.

2.3.1.1 parameters affecting the adjustment process

First, we need to adjust the process impact parameters, while the process impact parameters of Random Forest only have the "Number of submodels" (n_estimators ). The default value of "Number of submodels" is 10. Based on this, we take 10 as the unit and check the parameter adjustment with a value range of 1 to 201:

(Min_samples_leaf) takes 1 as the unit and sets the value range to 1 to 10. The result of the parameter adjustment is as follows:

In the unit of 100, the value range is 2500 to 3400. The result of the adjustment is as follows:

ParametersDefault Value AccuracyOptimal accuracy after adjustmentIncreaseSplit condition (criterion)0.9640238095240.9640238095240Max_feature)0.9633809523810.9644285714290.00104762Max_depth)  JitterMinimum number of samples required for splitting (min_samples_split)0.9639761904760.9639761904760Minimum number of leaf nodes (min_samples_leaf)0.9635952380950.9635952380950Maximum number of nodes (max_leaf_nodes)  Jitter

Next, we set the maximum feature (max_features) to 38 During the split, and submitted the result on Kaggle: 0.96671, Which is 0.00171 better than the previous parameter adjustment, the improvement is basically the same as we expected.

Do I need to continue the next round of coordinate descent parameter adjustment? Generally, there is not much need. In this round, there are two parameters with jitter, and the adjustment of other parameters does not improve the overall model performance. It's still quite difficult: Data and features determine the upper limit of machine learning, and models and algorithms only approach the upper limit. In the DR competition, it is better to mine more valuable features or use models with built-in Feature Mining skills than to further improve the overall model performance by tuning parameters for RandomForestClassifier, graph classification is more suitable for learning using neural networks ). However, here, we can confidently say that through the greedy coordinate descent method, we are more advanced than those who use the grid search method to find the optimal solution.

2.3.2 example of Gradient Tree Boosting Parameter Adjustment: Hackathon3.x

Here, we select Hackathon3.x on Analytics Vidhya as the case to demonstrate the process of adjusting parameters for GradientBoostingClassifier.

2.3.2.1 parameters affecting the adjustment process

GradientBoostingClassifier's process impact class parameters include "n_estimators" and "learning_rate". We can use GridSearchCV to find the optimal solution for these two parameters. Slow down! There is a big trap: the performance improvement brought about by "Number of submodels" and "learning rate" is not balanced. It will be relatively high in the early stage and will be relatively low in the later stage, if we set these two parameters to the optimum at the beginning, it is easy to fall into a local optimal solution. The result of the two parameters is displayed:

Number of submodels

N_estimators

Learning Rate

Learning_rate

Minimum number of leaf nodes

Min_samples_leaf

Maximum depth

Max_depth

Subsampling Rate

Subsample

Maximum number of features involved in judgment during split

Max_feature

600.11240.7710

By now, the overall model performance is 0.8313, which is about 0.8253 higher than workbench (0.006.

2.3.2.3 kill and remember a horse rifle

Do we still remember to show off the "n_estimators" and "learning_rate" at the beginning? Now we can look back and adjust these two parameters. The adjustment method is to multiply the number of submodels and multiply the learning rate ). This method improves the overall model performance by about 0.002.

2.4 local optimal solution

Currently, some empirical rules are widely used in parameter adjustment. Mongoshay Jain sums up a set of parameter adjustment methods for Gradient Tree Boosting. Its core idea is to first adjust the process-impact parameters. After all, they have the greatest impact on the overall model performance, and then, based on experience, among other parameters, select the parameter that has the greatest impact on the overall performance of the model. The key to this method is to sort the parameters according to the influence on the overall model performance, and then adjust the parameters in this order. How can we measure the influence of parameters on the overall model performance? Based on experience, mongoshay puts forward his opinion: "max_leaf_nodes" and "max_depth) the impact on the overall model performance is greater than "min_samples_split", "min_samples_leaf", and "min_weight_fraction_leaf ), max_features has the least influence.

The biggest difference between the method proposed by mongoshay and the greedy coordinate descent method is that the former sorts the parameters according to the influence on the overall model performance before parameter adjustment, the latter is a "natural" greedy process. Do you still remember the parameter tuning questions about "n_estimators" and "learning_rate" in section 2.3.2.1? Similarly, the greedy coordinate descent method is easy to fall into the local optimal solution, which is a little better for Parameter Adjustment of Random Forest, because when the number of submodels is adjusted to the optimal state, sometimes there are only parameters such as "Maximum number of features involved in judgment during split" that the smallest influence is adjustable. However, when tuning the Gradient Tree Boosting parameter, it is much more likely to encounter a local optimal solution.

Mongoshay also conducted a parameter adjustment test on Hackathon3.x. Due to the differences in Feature Extraction Methods and the same parameter value, the overall model performance in this article is still about 0.007 different from that in other cases (alas, I have to say this again, feature Engineering is really important ). First, in the selection of process impact parameters, the mongoshay method and the greedy coordinate descent method both select "Number of submodels" as 60, and the "learning rate" as 0.1. Then, mongoshay adjusted the parameters in sequence based on the influence of the defined parameters on the overall model performance. After the sub-model affects the class parameter determination, the mongoshay method improves the overall model performance by about 0.008, slightly better than the greedy coordinate descent method by 0.006. However, after continuing to debug the "Number of submodels" and "learning rate", mongoshay improved the overall model performance by about 0.01, far better than 0.002 of the greedy coordinate descent method.

Where did the mongoshay method and the greedy coordinate descent method fall apart? The former adjusts max_depth and min_samples_leaf. The latter is the opposite!

Ah! Ah! Ah! Shao Xia, please stop! Why do you want to introduce this greedy "useless" coordinate descent method in this blog? First, this method is easy to think of intuitively. People often spend a lot of time figuring out what the parameters of the model mean, and how they affect the overall model performance, so next, many people chose the most intuitive and greedy method of coordinate descent. Through an instance, we can easily remember the limitations of this method. In addition to being a negative textbook, is it meaningless to use the greedy descent method of coordinates? It is not hard to see that the mongoshay method is still improved. When we adjust parameters in sequence, we still need to analyze the parameter's "dynamic" influence like the greedy coordinate descent method, if this influence is "jitters" and is dispensable, we do not need to adjust this parameter.

 

3. Conclusion

In this blog post, I took most of the time to experiment and illustrate a flawed solution. A large part of the methods and techniques used in data mining work have not yet been rigorously proved, especially the young people who have just started out (including me) mistakenly think that it is a metaphysics. In fact, although it has not been rigorously proved, we can still test and analyze it, especially compare it with the existing method to obtain a similar rational argument.

In addition, do you have any unique parameter adjustment methods? Please do not mean anything, so you can release all your unique skills to me. Please leave a comment and make a cruel criticism!

 

4 references

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.