In the summary of the principle of adaboost algorithm of integrated learning, we summarize the principle of adaboost algorithm. Here we from a practical point of view on the use of the Scikit-learn AdaBoost class library To do a summary, focus on the attention of the issue to do a summary.
1. AdaBoost Class Library Overview
Scikit-learn in AdaBoost class library is directly, Adaboostclassifier and Adaboostregressor Two, from the name can be seen Adaboostclassifier for classification, Adaboostregressor for regression.
Adaboostclassifier uses the implementation of two AdaBoost classification algorithms, Samme and SAMME.R. The adaboostregressor uses the implementation of the AdaBoost regression algorithm we talked about in our theory, namely ADABOOST.R2.
When we are on the adaboost, we mainly need to adjust the two parts of the content, the first part is to our AdaBoost framework for the assistant, the second part is to our choice of weak classifier for the assistant. The two complement each other. Here are the two classes for AdaBoost: Adaboostclassifier and adaboostregressor from these two parts.
2. Adaboostclassifier and Adaboostregressor frame parameters
Let's take a look at the Adaboostclassifier and adaboostregressor frame parameters first. Most of the framework parameters are the same, we discuss these parameters together, two classes if there are different points we will point out.
1)Base_estimator:adaboostclassifier and Adaboostregressor have, that is, our weak classifier or weak regression learning device. In theory, you can choose either a classification or a regression learner, but you need to support sample weights. We commonly use the cart decision tree or the neural network MLP. The default is the decision tree, where Adaboostclassifier uses the cart classification tree decisiontreeclassifier By default, and Adaboostregressor uses the cart regression tree decisiontreeregressor by default. Another point to be aware of is that If we choose the Adaboostclassifier algorithm is SAMME.R, then our weak classifier will also need to support the probability of prediction, that is, in the Scikit-learn the corresponding prediction method of weak classification learners in addition to predict also need to have predict_ Proba
2)algorithm: This parameter is only adaboostclassifier. The main reason is that Scikit-learn realizes two kinds of adaboost classification algorithms, Samme and SAMME.R. The main difference between the two is the measurement of the weight of the weak learner, Samme uses the extension of the two-yuan classification adaboost algorithm in our principle, that is, using the classification effect of the sample set as the weak learner weight, and SAMME.R uses the predictive probability size of the sample set classification as the weak learner weight. Since SAMME.R uses a continuous value for probability measures, iterations are generally faster than samme, so the adaboostclassifier default algorithm algorithm value is also SAMME.R. We generally use the default SAMME.R, but it is important to note that the SAMME.R is used, and the weakly categorical learner parameter Base_estimator must limit the use of classifiers that support probabilistic predictions. The samme algorithm does not have this limitation.
3)loss: This parameter only Adaboostregressor has, ADABOOST.R2 algorithm needs to use. There are linear ' linear ', square ' square ' and exponential ' exponential ' three choices, which are linear by default, and are generally sufficient for linear use unless you suspect that this parameter results in a bad fit. The significance of this value is also discussed in the principle, which corresponds to our handling of the error of the first sample of the K weakly classifier, i.e., if it is a linear error, then $e_{ki}= \frac{|y_i-g_k (x_i) |} {e_k}$; if it is squared error, then $e_{ki}= \frac{(Y_i-g_k (x_i)) ^2}{e_k^2}$, if it is an exponential error, $e_{ki}= 1-exp (\frac{-y_i + g_k (x_i))}{e_k}) $, $E _k$ maximum error on the training set $e_k= max|y_i-g_k (x_i) |\;i=1,2...m$
4) n_estimators: Adaboostclassifier and Adaboostregressor have, is the maximum number of iterations of our weak learner, or the largest number of weak learners. Generally speaking, n_estimators is too small, easy to fit, n_estimators too large, and easy to fit, generally choose a moderate value. The default is 50. In the process of actual tuning, we often consider n_estimators and the parameters described below Learning_rate.
5) learning_rate: Adaboostclassifier and Adaboostregressor have, that is, each weak learner weight reduction factor $\nu$, in the principle of the regularization chapter we also talked about, plus the regularization, Our strong learner's iteration formula is $f_{k} (x) = f_{k-1} (x) + \nu\alpha_kg_k (x) $. The $\nu$ value range is $ < \NU \leq 1 $. For the same training set fitting effect, the smaller $\nu$ means that we need more iterations of the weak learner. Usually we use the maximum number of steps and iterations to determine the fit effect of the algorithm. So the two parameters n_estimators and learning_rate to be joined together. In general, you can start with a smaller $\nu$, the default is 1.
3. Adaboostclassifier and adaboostregressor Weak learner parameters
Here we discuss the adaboostclassifier and adaboostregressor weak learner parameters, because of the use of different weak learners, the corresponding weak learner parameters are different. Here we only discuss the parameters of the default decision tree weak learner. That is, the cart classification tree decisiontreeclassifier and the cart regression tree decisiontreeregressor.
Decisiontreeclassifier and Decisiontreeregressor parameters are basically similar, in the Scikit-learn Decision Tree algorithm Class Library Use summary This article we have the parameters of these two classes are explained in detail. Here we only take out the parameters of the most important points to pay attention to the parameters of the most significant again to say again:
1) The maximum number of features to consider when dividing max_features: You can use many types of values, the default is "None", which means that all features are considered when dividing, and if "log2" means that the partition takes up to $log_2n$ characteristics; sqrt "or" auto "means that most of the $\sqrt{n}$ characteristics are considered when dividing. If it is an integer, it represents the absolute number of features considered. In the case of floating-point numbers, the representation considers the feature percentage, which is the number of features after rounding (percent xn). where n is the total number of characteristics of the sample. In general, if the sample features are not many, such as less than 50, we use the default "None" can be, if the number of features is very large, we can flexibly use the other values just described to control the maximum number of features to consider when partitioning, to control the decision tree generation time.
2) Decision tree maximum deep max_depth: Default can not input, if not input, the decision tree when establishing a subtree does not limit the depth of the subtree. In general, you can ignore this value when there is less data or features. If the model sample size is large and the characteristics are many, it is recommended to limit the maximum depth, depending on the distribution of the data. Commonly used can be valued between 10-100.
3) Minimum number of samples required for internal node re-partitioning min_samples_split: This value restricts the condition that the subtree continues to divide, and if a node has fewer samples than min_samples_split, it will not continue to try to select the optimal feature for partitioning. The default is 2. If the sample size is small, do not need to tube this value. If the sample quantity is very large, it is recommended to increase this value.
4) Minimum number of leaf nodes min_samples_leaf: This value limits the minimum number of leaf nodes, and if a leaf node is less than the number of samples, it will be pruned along with the sibling nodes. The default is 1, an integer that can enter a minimum number of samples, or a minimum number of samples as a percentage of the total number of samples. If the sample size is small, you do not need to tube this value. If the sample quantity is very large, it is recommended to increase this value.
5) Minimum sample weights and min_weight_fraction_leaffor leaf nodes: This value limits the minimum value of all sample weights and the leaf node, and if it is less than this value, it is pruned along with the sibling nodes. The default is 0, which is to not consider the weight issue. In general, if we have more samples with missing values, or if the classification tree sample has a large variation in the distribution category, we will introduce the sample weights, and we should pay attention to this value.
6) Maximum leaf node number max_leaf_nodes: By limiting the maximum number of leaf nodes, you can prevent overfitting, the default is "None", that is, the maximum number of leaf nodes is not limited. If the limit is added, the algorithm will establish the optimal decision tree in the maximum number of leaf nodes. If the feature is not many, you can not consider this value, but if the features are divided into more, you can limit the specific values can be obtained by cross-validation.
4. Adaboostclassifier Combat
Here we use a concrete example to explain the usage of adaboostclassifier.
First we load the required class library:
Import NumPy as NP Import Matplotlib.pyplot as plt%matplotlib inlinefromimport adaboostclassifier fromimport decisiontreeclassifierfromimport Make_ Gaussian_quantiles
Then we generate some random data to do two-dollar classification, if you are not familiar with how to generate random data, in another article machine learning algorithm of random data generation in a more detailed introduction.
# generates a 2-D normal distribution with the resulting data divided into two categories, 500 samples, 2 sample features, and a covariance factor of 2 X1, y1 = Make_gaussian_quantiles (cov=2.0,n_samples=500, n_features=2,n_classes=2, random_state=1)# Generate a 2-D normal distribution, the resulting data is divided into two categories of quantile, 400 samples, 2 sample features mean 3, the covariance coefficient is 2X2, y2 = Make_gaussian_quantiles (mean= (3, 3), cov= 1.5,n_samples=400, n_features=2, n_classes=2, random_state=1)# speak two sets of data to synthesize a set ofdata X = = Np.concatenate ((y1,-y2 + 1))
We look at our categorical data visually, which has two features, two output categories, and color differences.
Plt.scatter (x[:, 0], x[:, 1], marker='o', c=y)
The output is:
You can see that the data is a bit promiscuous, we are now using the AdaBoost based on the decision tree to classify the fit.
BDT = Adaboostclassifier (Decisiontreeclassifier (max_depth=2, min_samples_split=20, min_samples_leaf=5), Algorithm="samme", n_estimators=200, learning_rate=0.8) bdt.fit (X, y)
Here we choose the samme algorithm, up to 200 weak classifiers, step 0.8, in practice you may need to select the best parameters by cross-validation of the parameter. After the fitting is finished, we use the grid to look at the area it fits.
X_min, X_max = x[:, 0].min ()-1, x[:, 0].max () + 1= x[:, 1].min ()-1, x[:, 1].max () + 1= Np.meshgrid (n P.arange (X_min, X_max, 0.02), 0.02== Plt.contourf (xx, yy, Z, cmap= 1], marker='o', c=y) plt.show ()
The graph of the output is as follows:
As you can see, the fitting effect of adaboost is good, now let's look at the fitting score:
Print " Score: ", Bdt.score (x, y)
The output is:
score:0.913333333333
That is to say, the score of fitting training set data is good. Of course, high scores are not necessarily good, because they may be over-fitted.
Now we have increased the maximum number of weak separators from 200 to 300. Let's take a look at the fit score.
BDT = Adaboostclassifier (Decisiontreeclassifier (max_depth=2, min_samples_split=20, min_samples_leaf=5), Algorithm="samme", n_estimators=300, learning_rate=0.8) bdt.fit (X, y) Print " Score: ", Bdt.score (x, y)
The output at this time is:
score:0.962222222222
This confirms what we said earlier, the more the number of weak separators, the better fit, of course, the more easily overfitting.
Now let's reduce the step size, reduce the step size from 0.8 to 0.5, and then look at the fit score.
BDT = Adaboostclassifier (Decisiontreeclassifier (max_depth=2, min_samples_split=20, min_samples_leaf=5), Algorithm="samme", n_estimators=300, learning_rate=0.5) bdt.fit (X, y) Print " Score: ", Bdt.score (x, y)
The output at this time is:
score:0.894444444444
In the case of the same weak classifier, the fit effect decreases if the step size is reduced.
Finally, let's look at the case where the number of weak classifiers is 700 and the step is 0.7:
BDT = Adaboostclassifier (Decisiontreeclassifier (max_depth=2, min_samples_split=20, min_samples_leaf=5), Algorithm="samme", n_estimators=600, learning_rate=0.7) bdt.fit (X, y) Print " Score: ", Bdt.score (x, y)
The output at this time is:
score:0.961111111111
At this point the fitting fraction is the same as our original 300 weak classifier, with a 0.8 step size. That is, in our case, if the step size drops from 0.8 to 0.7, the number of weak classifiers is increased from 300 to 700 to achieve a similar fitting effect.
The above is a summary of the use of Scikit-learn AdaBoost class library, hoping to help friends.
(Welcome reprint, reproduced please indicate the source.) Welcome to communicate: [email protected])
Scikit-learn AdaBoost Class Library Usage Summary