Rotating random Forest algorithm

Last Update:2018-09-07 Source: Internet

Author: User

Tags iterable

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When there is a nonlinear relationship in the input data, the model based on the linear regression will fail, and the tree-based algorithm is not affected by the nonlinear relationship in the data, and the tree-based method is the most difficult to prune the tree in order to avoid overfitting, and for the noise in the latent data, the large tree tends to be affected. Results in a low deviation (overfitting) or high variance (extremely non-fitting). However, if we generate a large number of trees, the final prediction value can be avoided by integrating the average of the output produced by all the trees.

1. Random Forest: Integrated technology, using a large number of trees to model, but here we want to ensure that the tree is not related to each other, can not select all attributes, but randomly select a subset of a property to a tree . Although we are then randomly building trees of the largest depth in the forest, so that they can adapt well to the bootstrap sample, resulting in a lower deviation, the result is the introduction of high variance , but by building a large number of trees, using the average law as the final prediction value, you can solve the variance problem.

2. Hyper-random tree: More randomization than random forest, can solve the variance problem more efficiently, its operation complexity is also slightly reduced. Random Forest is an example of a bootstrap part to give to each tree, but the super random tree is using the complete training set data, and in addition to the number of randomly selected attributes given k as a given node, it randomly chooses the cut point, regardless of the target variable, unlike the random forest based on the Gini purity or entropy standard. This more randomization brings the architecture to better reduce the variance. And because the partitioning node does not require relevant standards , it does not take time to identify the most appropriate attribute to use to partition the dataset.

3. Rotating the forest: the first two need to integrate a large number of trees to achieve good results, and the rotating forest can use smaller trees to obtain the same or even better results. The algorithm scene is the voting scene , the attribute is divided into the equal size K non-overlapping subset , then unifies the PCA, the rotation matrix to complete the model construction.

Here's the code:

1 #-*-coding:utf-8-*-2 """3 Created on Wed Apr One-17:01:22 20184 @author: Alvin AI5 """6 7  fromSklearn.datasetsImportmake_classification8  fromSklearn.metricsImportClassification_report9  fromSklearn.cross_validationImportTrain_test_splitTen  fromSklearn.decompositionImportPCA One  fromSklearn.treeImportDecisiontreeclassifier A ImportNumPy as NP -  -  the #Loading Data - defget_data (): -No_features = 50 -redundant_features = Int (0.1 *no_features) +informative_features = Int (0.6 *no_features) -repeated_features = Int (0.1 *no_features) +X, y = make_classification (n_samples=500, n_features=No_features, Aflip_y=0.03, n_informative=Informative_features, atN_redundant=Redundant_features, -N_repeated=repeated_features, random_state=7) -     returnx, y -  -  - #get a random subset in defGet_random_subset (iterable, k): -subsets = [] toiteration =0 +Np.random.shuffle (iterable)#To disrupt a feature index -subset =0 theLimit = Len (iterable)/k *      whileIteration <Limit: $         ifK <=Len (iterable):Panax Notoginsengsubset =k -         Else: thesubset =Len (iterable) +Subsets.append (iterable[-subset:]) A         deliterable[-subset:] theIteration + = 1 +     returnsubsets -  $  $ #Building a rotating forest model - defBuild_rotationtree_model (X_train, Y_train, D, K): -Models = []#Decision Tree theR_matrices = []#tree-related rotation matrices -Feature_subsets = []#subset of features used in iterationsWuyi      forIinchRange (d): theX, _, _, _ = Train_test_split (X_train, Y_train, test_size=0.3, random_state=7) -         #index of the feature WuFeature_index = Range (x.shape[1]) -         #get a subset of features About         #10 subsets with 5 indexes per subset $Random_k_subset = Get_random_subset (Feature_index, K)#10 Subsets -Feature_subsets.append (Random_k_subset)#25 trees, 10 subsets of each tree -R_matrix = Np.zeros ((x.shape[1], x.shape[1]), dtype=float)#rotation matrix -          forEach_subsetinchRandom_k_subset: APCA =PCA () +X_subset = x[:, Each_subset]#extracts the x value corresponding to the index within the subset thePca.fit (X_subset)#principal Component Analysis -              forIiinchRange (0, Len (pca.components_)): $                  forJjinchRange (0, Len (pca.components_)): theR_MATRIX[EACH_SUBSET[II], each_subset[jj]] =  the Pca.components_[ii, JJ] theX_transformed =X_train.dot (R_matrix) the  -Model =Decisiontreeclassifier () in Model.fit (x_transformed, Y_train) the models.append (model) the r_matrices.append (R_matrix) About     returnmodels, r_matrices, Feature_subsets the  the  the defModel_worth (Models, r_matrices, X, y): +Predicted_ys = [] -      forI, modelinchEnumerate (models): theX_mod =X.dot (r_matrices[i])BayiPredicted_y =model.predict (x_mod) the predicted_ys.append (predicted_y) the  -Predicted_matrix = Np.asmatrix (Predicted_ys)#Convert to Matrix 25*350 -  theFinal_prediction = [] the          forIinchRange (len (y)): thePred_from_all_models = Np.ravel (predicted_matrix[:, I])#To reduce a multidimensional array to one-dimensional thenon_zero_pred = Np.nonzero (Pred_from_all_models) [0]#Nonzeros (a) returns the subscript of an element with a value not zero in array a -Is_one = Len (non_zero_pred) > Len (models)/2#1 If non-0 predictions are greater than half of the total number of trees in the model the final_prediction.append (Is_one) the         Printclassification_report (y, final_prediction) the     returnPredicted_matrix94  the  the #Main function the if __name__=="__main__":98X, y =Get_data () About     #Data Set Partitioning -X_train, X_test_all, y_train, Y_test_all =train_test_split (x, Y,101test_size=0.3, Random_state=9)102X_dev, X_test, y_dev, y_test =train_test_split (X_test_all, Y_test_all,103test_size=0.3, Random_state=9)104Models, r_matrices, features = Build_rotationtree_model (X_train, Y_train, 25, 5)#number of outputs 25, subset of features to be used 5 thePREDICTED_MATRIX1 =Model_worth (Models, R_matrices, X_train, Y_train)106PREDICTED_MATRIX2 = Model_worth (Models, R_matrices, X_dev, Y_dev)

See 80090175 for details

Rotating random Forest algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More