Sklearn Integration (Ensemble methods) (Part I)

Source: Internet
Author: User

1.11 Integrated Method (Ensemble methods)

The purpose of the integration method is to set up a variety of basic predictive models to improve the generalization ability and robustness of a single model.

▲ Two types of integration methods:

• Average estimate: The main principle of the average number of independent prediction models of the predicted results. In general, the model is primarily designed to reduce variance , so the combined predictions are better than any single prediction result. For example: Bagging, Random forest and so on.

Boosting method: The main purpose of this integration method is to reduce the deviation . A strong learning device is produced by the set of weak learners. For example: Adaboost,gradient treeboosting and so on.

1.11.1 Bagging Estimate (Bagging meta-estimator)

▲ This estimation algorithm, mainly in the original data set, uses the self-service sampling method to produce several data sets, through these data sets training the learner, in the collection these learner's result, obtains the final result. These methods are mainly aimed at reducing variance. In most cases, the bagging method promotes a single model in a simple way, so that the results are not consistent with the underlying method results. in the reduction of overfitting, Bagging method is good for integrating powerful and complex models, and boosting the method has good performance in the integration of weak learners.

The ▲bagging integration method can be divided into different models depending on how the training subset is divided:

Pasting: Training sample is a random subset of the entire data set

Bagging: The training sample is pumped through the entire data set-put back-to extract the resulting

Random subspace: The training sample is randomly obtained through the properties of the entire data set

Random Patches: Training samples are randomly obtained through the data and attributes of the entire data set

▲ in Scikit-learn, the bagging method is provided by Baggingclassifier, with the base model of user input and the method of dividing subsets as parameters. where Max_samples and max_features control the size of the subset, and Bootstrap and bootstrap_features control whether the data samples and attributes are replaced. Oob_score=true makes it possible to use existing data to classify samples when estimating. The following example shows the integration of the kneighborclassifier estimate using the bagging method, with the training sample partitioning rule: Random 50% data Samples and 50% properties.

>>> from sklearn.ensemble import baggingclassifier
>>> from sklearn.neighbors Import Kneighborsclassifier
>>> bagging = Baggingclassifier (Kneighborsclassifier (),
...                             max_samples=0.5, max_features=0.5)

The following figure shows a single model (decision regression tree) bagging integration algorithm in the deviation-variance comparison, more specific analysis can be seen: Http://scikit-learn.org/stable/auto_examples/ensemble/plot_bias _variance.html#sphx-glr-auto-examples-ensemble-plot-bias-variance-py


The relevant code is:

#-*-Coding:utf-8-*-"" "Created on Sat Jan 14:37:54 @author: ZQ" "" Import NumPy as NP import MATPLOTLIB.PYP IoT as PLT from sklearn.ensemble import baggingregressor from sklearn.tree import decisiontreeregressor #数据集 n_repeat = #计算期望的迭代数 N_train = #训练数据集大小 N_test = #测试数据集大小 Nosie = 0.1 np.random.seed (0) estimators = [("Tree", Decisiontre

Eregressor ()), ("Bagging (Tree)", Baggingregressor (Decisiontreeregressor ()))] n_estimators = Len (estimators) #生产数据 def f (x): x = X.ravel () return Np.exp (-x**2) + 1.5*np.exp (-(x-2) **2) def generate (N_sample,noise,n_repe at = 1): x = Np.random.rand (n_sample) *10-5 x = Np.sort (x) If n_repeat = = 1:y = f (x) + Np.random.nor
            Mal (0.0,noise,n_sample) #正态分布 else:y = Np.zeros ((n_sample,n_repeat)) for I in Range (n_repeat): Y[:,i] = f (x) + np.random.normal (0.0,nosie,n_sample) X = X.reshape ((n_sample,1)) return x, y x_train = [] y_t Rain = [] for i in rangE (n_repeat): x, y = Generate (N_sample=n_train,noise=nosie) x_train.append (X) y_train.append (y) x_test,y_test = Generate (N_sample=n_test,noise=nosie,n_repeat=n_repeat) #循环对学习方法进行比较 for N, (Name,estimator) in enumerate (estimators
        ): Y_predict = Np.zeros ((n_test,n_repeat)) for I in Range (n_repeat): Estimator.fit (X_train[i],y_train[i]) Y_predict[:,i] = estimator.predict (x_test) y_error = Np.zeros (n_test) for I in Range (n_repeat): fo R j in Range (N_repeat): Y_error + = (Y_test[:,j]-y_predict[:,i]) **2 y_error/= (n_repeat*n_repeat) y_no ise = Np.var (Y_test,axis = 1) Y_bias = (f (x_test)-Np.mean (Y_predict,axis = 1)) * * 2 Y_var = Np.var (y_predict,axi s = 1) plt.subplot (2, N_estimators, n + 1) plt.plot (X_test, F (x_test), "B", label= "$f (X) $") Plt.plot (x_tr Ain[0], y_train[0], ". B", label= "LS ~ $y = f (x) +noise$") for I in Range (n_repeat): if i = = 0:plt . Plot (X_test, y_predict[:, I], "R", label= "$\^y (x) $") Else:plt.plot (X_test, y_predict[:, I], "R", alpha=0.05) Plt.plot (x_test , Np.mean (Y_predict, Axis=1), "C", label= "$\mathbb{e}_{ls} \^y (x) $") Plt.xlim ([ -5, 5]) Plt.title (nam e) if n = = 0:plt.legend (loc= "upper left", prop={"size": one}) Plt.subplot (2, N_estimators, n_estimators
    + n + 1) plt.plot (x_test, Y_error, "R", label= "$error (x) $") Plt.plot (X_test, Y_bias, "B", label= "$bias ^2 (x) $"), Plt.plot (X_test, Y_var, "G", label= "$variance (x) $"), Plt.plot (X_test, Y_noise, "C", label= "$noise (x) $") Plt.xli M ([ -5, 5]) Plt.ylim ([0, 0.1]) if n = = 0:plt.legend (loc= "upper left", prop={"size": one}) Plt.show ()

1.11.2 random Senrin (forests of randomized trees)

▲ The Sklearn.ensemble module contains two average integration algorithms based on decision Trees: Random Forest (the randomforest) and extreme random tree (extra-trees). These two algorithms are specifically designed for tree models, which means that randomness is introduced when constructing a base learning model. This prediction results through the average of the results of each base model.

▲ As with other classifiers, the classifier must be fitted with two arrays: an x array of size [n_samples,n_features], a training sample, a y array of size [n_samples], and a label for the training sample. Can be extended to multi-classification issues.

>>> from sklearn.ensemble import randomforestclassifier
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> CLF = Randomforestclassifier (n_estimators=10) #n_estimators为决策树的个数
>>> CLF = Clf.fit (X, Y)

1.11.2.1 Random Forest (randomforests)

▲ in a random forest, each decision tree is fitted with a training set generated by repeatedly extracting data from the data set. In addition, in the process of producing decision trees, the selection of nodes is no longer the best attribute in the attribute; the node is the best split node of the subset. Because of this randomness, the forest deviation is usually slightly increased, the variance is reduced due to the average reason, and the deviation is compensated, so the overall profit of the model is better. In contrast to the original version, Sklearn is integrated by means of average probabilities, rather than by voting.

1.11.2.2 Extreme Random tree (extremelyrandomized Trees)

▲ in an extremely random tree, randomness is further considered in the process of splitting a node. In the extreme random tree, the training set is a random combination of attributes, and when the node is segmented, it is not looking for the best segmentation threshold, but the random extraction. Typically, the model will further reduce the variance, but its bias will be relative to the increase.

>>> from sklearn.model_selection import cross_val_score >>> from sklearn.datasets import Make_blobs & Gt;>> from sklearn.ensemble import randomforestclassifier >>> from sklearn.ensemble Import Extratreesclassifier >>> from sklearn.tree import decisiontreeclassifier >>> X, y = make_blobs (N_sampl es=10000, n_features=10, centers=100, ... random_state=0) >>> CLF = Decisiontreeclassifier (Max_depth=none, M                             
in_samples_split=2, ... random_state=0) >>> scores = Cross_val_score (CLF, X, y) >>> Scores.mean () 0.97 >>> CLF = Randomforestclassifier (n_estimators=10, Max_depth=none, ... min                             
_samples_split=2, random_state=0) >>> scores = Cross_val_score (CLF, X, y) >>> Scores.mean () 0.999 >>> CLF = Extratreesclassifier (n_estimators=10, Max_depth=none, ... min_samples_sp lit=2, random_state=0) >>> scores = Cross_val_score (CLF, X, y) >>> Scores.mean () > 0.999 True
 

1.11.2.3 parameter (Parameters)

▲ The main parameters of the setup include N_estimators and max_features. The former is the number of trees in the forest, the larger the better, but its calculation time will be longer, but after a certain value, the result will not be changed. The latter is the number of attributes in the cluster when the node is divided, and the lower the variance, the higher the deviation. In the regression problem, Max_features=n_features is best; In the classification problem, MAX_FEATURES=SQRT (n_fetures) (N_features is the number of attributes in the dataset). Make Max_depth=none, and min_samples_split=1, the result is better (that is, fully develop the tree). These parameters are usually not optimal and may consume a lot of memory. The best parameters are usually obtained by cross-validation. The bootstrap parameter is not the same in the random forest and in the extreme random forest, the former is true and the latter is false.

1.11.2.4 parallelization (parallelization)

▲ This module can set up parallel computation and parallel structure by n_jobs parameter. such as n_jobs=k, you need to run on a K-core computer, such as N_jobs=-1, all computers can be run. K-Jobs here may not be k-times, but there is a significant speed-up effect when building a large number of decision trees and using large amounts of data.

1.11.2.5 feature Importance estimation (Feature importanceevaluation)

▲ In the decision tree, the depth of the attribute node can be used to evaluate the relative importance of the attribute. At large-scale sample input, the properties of the decision Tree top help to make the prediction results. The results obtained from the sample can be used to reflect the importance of these characteristics. The following example is a color coding that shows the importance of each pixel in face recognition using the Extratreesclassifier model.
Related code:

#-*-Coding:utf-8-*-
"" "
Created on Tue Jan 22:52:31

@author: ZQ
" "" From time

import time< C5/>import Matplotlib.pyplot as PLT from

sklearn.datasets import fetch_olivetti_faces from
sklearn.ensemble Import extratreesclassifier

#多线程工作
n_jobs = 1

#加载数据
data = fetch_olivetti_faces ()
X = Data.images.reshape (Len (data.images), -1))
y = data.target

mask = y<5
X = x[mask]
y = y[mask]< c17/> #计算像素的重要程度
Print ("Fitting extratreesclassifier on faces data with%d cores ..."% n_jobs)
t0 = time ()
Forest = Extratreesclassifier (n_estimators=1000,
                              max_features=128,
                              n_jobs=n_jobs,
                              random_state=0)
forest.fit (x, y)
print ("Done in%0.3fs"% (Time ()-t0))

importances = Forest.feature_importances_
importances = Importances.reshape (data.images[0].shape)

plt.matshow (importances,cmap = plt.cm.hot)
plt.title (' Pixel importances with forests of trees ')
plt.show ()
 

1.11.2.6 Embedded Complete random tree (totallyrandom Trees embedding)

▲randomtreesembedding realizes unsupervised learning of data. Its encoding is in the form of 1 to K, from high-dimensional data to sparse binary encodings. This coding method is effective and can be used as a basis for other learning tasks. The size and sparsity of the encoding can be determined by selecting the number and depth of the tree. In the integration of each tree, its encoding includes each complete tree. The maximum size of the encoding is n_estimators*2**max_depth, the maximum number of leaves in the forest. Adjacent data points are likely to have the same leaf nodes, during which the conversion is an implicit, non-parametric density estimate. The following example is a hash conversion using a completely random tree: http://scikit-learn.org/stable/auto_examples/ensemble/plot_random_forest_embedding.html# Sphx-glr-auto-examples-ensemble-plot-random-forest-embedding-py
Related code:

#-*-Coding:utf-8-*-"" "Created on Wed Jan 15:34:53 @author: ZQ" "" Import NumPy as NP import MATPLOTLIB.PYP IoT as PLT from sklearn.datasets import make_circles from sklearn.ensemble import randomtreesembedding,extratreesclassif Ier from sklearn.decomposition import TRUNCATEDSVD from sklearn.naive_bayes import bernoullinb #创建一个数据集 x, y = make_circle
                              S (factor=0.5,random_state=0,noise=0.05) #使用RandomTreesEmbedding转换数据 hasher = randomtreesembedding (n_estimators=10, Random_state=0, max_depth=3) x_transformed = Hasher.fit_transform (X ) #使用截断奇异值分解降低数据维数 SVD = TRUNCATEDSVD (n_components=2) x_reduced = Svd.fit_transform (x_transformed) #拟合贝叶斯分类器 NB = Bernoul LINB () Nb.fit (x_transformed,y) #利用极端随机树进行学习 trees = Extratreesclassifier (max_depth=3, N_estim ators=10, random_state=0) trees.fit (x, y) #原始和减少后的散点图 fig = plt.figure (figsize= (9,8)) ax = P Lt.subplot (221) Ax.scatter(x[:,0],x[:,1],c=y,s=50) Ax.set_title ("Original Data (2d)") Ax.set_xticks (()) Ax.set_yticks (()) ax = Plt.subplot (222)
Ax.scatter (x_reduced[:,0],x_reduced[:,1],c=y,s=50) #ax. Scatter (x_transformed[:,0],x_transformed[:,1],c=y,s=50) Ax.set_title ("Truncated SVD Reduction (2d) of transformed data (%DD)"% x_transformed.shape[1]) ax.set_xticks (()) Ax.set_yticks (()) #为画彩色图作准备 h =. X_min, X_max = x[:, 0].min ()-. 5, x[:, 0].max () +. 5 y_min, Y_max = x[:, 1].min  ()-. 5, x[:, 1].max () +. 5 xx, yy = Np.meshgrid (Np.arange (X_min, X_max, h), Np.arange (Y_min, Y_max, h)) Transformed_grid = Hasher.transform (Np.c_[xx.ravel (), Yy.ravel ()]) y_grid_pred = Nb.predict_proba (Transformed_grid) [:, 1] ax = PLT.SUBPL OT (223) ax.set_title ("Naive Bayes on Transformed data") Ax.pcolormesh (xx, yy, Y_grid_pred.reshape (xx.shape)) Ax.scatter (x[:, 0], x[:, 1], c=y, s=50) Ax.set_ylim ( -1.4, 1.4) Ax.set_xlim ( -1.4, 1.4) #去掉横纵坐标 Ax.set_xticks (()) Ax.set_yticks (()) Y _grid_pred = Trees.predict_proba (NP.C_[XX.RAvel (), Yy.ravel ()]) [:, 1] ax = plt.subplot (224) ax.set_title ("extratrees predictions") Ax.pcolormesh (xx, yy, Y_grid_pre D.reshape (Xx.shape)) Ax.scatter (x[:, 0], x[:, 1], c=y, s=50) Ax.set_ylim ( -1.4, 1.4) Ax.set_xlim ( -1.4, 1.4) ax.set_xticks ( ()) Ax.set_yticks (()) Plt.tight_layout () plt.show ()






Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.