Kaggle Data Mining Competition preliminary--titanic <随机森林&特征重要性> _

Kaggle Data Mining Competition preliminary--titanic <随机森林&特征重要性> __ Data Mining

Last Update:2018-08-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous three posts have been a fairly complete feature engineering, analyzing string-type variables to get new variables, normalize numeric variables, get derived properties and make dimensional specifications. Now that we have a feature set, we can do a training model.

Because this is a classification problem, you can use L1 SVM random forest classification algorithm, random forest is a very simple and practical classification model, adjustable variables are few. A very important variable is the number of trees, the number of trees increased to a certain size will increase the time consuming, but the precision does not increase a lot.

After the previous feature engineering, there are now 237 features, too many features can fit the model, fortunately, random forest after training can produce a variety of characteristics of the importance of the dataset, we can use this dataset, to determine a threshold, selected for the model training to help the largest number of properties, The parameters of the random forest used here are the default values.

 1 X = input_df.values[:, 1:: 2 y = input_df.values[:, 0] 3 Survived_weight =. 4 Y_weights = Np.arr Ay ([survived_weight if s = = 0 Else 1 for s on y]) 5 6 print "Rough fitting a randomforest to determine feature Impo
 Rtance ... " 
 7 forest = Randomforestclassifier (Oob_score=true, n_estimators=10000) 8 Forest.fit (X, y, sample_weight=y_weights) 9 Feature_importance = Forest.feature_importances_ feature_importance = 100.0 * (feature_importance/feature _importance.max ()) Fi_threshold = Important_idx = Np.where (Feature_importance > Fi_threshold) [ 0] important_features = features_list[important_idx] print "\ n", important_features.shape[0], "important FEA Tures (> ", Fi_threshold,"% of max importance) ... \ n "#, #important_features sorted _idx = Np.argsort (Feature_importance[important_idx]) [:: -1] #get The figure about important features POS = NP . Arange (sortEd_idx.shape[0]) +. 5 Plt.subplot (1, 2, 2) Plt.title (' Feature importance ') Plt.barh (POS, Feature_impo Rtance[important_idx][sorted_idx[::-1]], color= ' R ', align= ' center ') plt.yticks (POS, Important_featu RES[SORTED_IDX[::-1]]) Plt.xlabel (' relative importance ') Plt.draw () plt.show ()

The code is a bit long, but mainly divided into two, one is model training, the other is based on the importance of training to screen important features and drawing.

The attributes that are more important than 18 are obtained as shown in the following illustration:

It is important to see the three properties of TILTLE_MR title_id gender. and the title related to the attributes are our analysis of the name, can be seen in some string properties may be hidden in a very important information, in special projects to pay attention to rather than discard it. Because our original attributes are very small, most of the important attributes produced are the mathematical combination of the original attribute, which may not be required, which is primarily related to the model, but most of the time it is harmless to derive the variable. For the random forest training data to the easy model, perhaps some of the original properties directly used for training will also have a very good effect, but as a learning problem, of course, what the approach to try again, accumulate experience.

For a random forest how to get the importance of the face, you can look at the official documents of Scikit Learn Scikit-learn.org/stable/auto_examples/ensemble/plot_forest_ Importances.html#example-ensemble-plot-forest-importances-py

Of course, after getting the important features, we have to remove the unimportant features to improve the model's training speed (the threshold can be adjusted slightly to retain more features)

1     X = x[:, important_idx][:, Sorted_idx]
2     submit_df = Submit_df.iloc[:,important_idx].iloc[:,sorted_idx ]

Now we have the final dataset, which can finally be formally used to train the model.

The above section is used for the default parameters of random forests, but the parameters of the model are adjustable and we need to adjust the parameters for better training. Scikit Learn provides two methods for parameter optimization, and is also a common method for other tools, one is Gridsearch and the other is randomizedsearch. In both cases, you can specify a range of values for each parameter and create a dictionary. Provides a parameter dictionary to the search method, which executes a combination of the values specified by the model. For Gridsearch, it tests each possible combination of parameters. Randomizedsearch allows you to specify how many different combinations to test, and then randomly select a combination. If you are using a lot of model key parameters, Randomizedsearch is useful to help save time.

 1 sqrtfeat = Int (np.sqrt (x.shape[1))) 2 minsampsplit = Int (x.shape[0]*0.015) 3 # (adapted from http://scikit-learn.org/  stable/auto_examples/randomized_search.html) 4 def (Grid_scores, n_top=5): 5 params = None 6 top_scores = Sorted (Grid_scores, Key=itemgetter (1), reverse=true) [: N_top] 7 for-I, score in Enumerate (top_scores): 8 pri               NT ("Parameters with rank: {0}". Format (i + 1)) 9 print ("Mean validation score: {0:.4f} (std: {1:.4f})". Format (10 Score.mean_validation_score, NP.STD (score.cv_validation_scores)) One print ("Parameters: {0}". Format ( 
score.parameters)) ("") [params = = none:15 Params = score.parameters                Params # simple grid Test Grid_test1 = {"N_estimators": [1000, 2500, 5000], 20 "Criterion": ["Gini", "Entropy"], "max_features": [Sqrtfeat-1, Sqrtfeat, sqrtfeat +1], 22 "Max_depth ": [5]," Min_samples_split ": [2, 5, 10,minsampsplit]} forest = Rando
Mforestclassifier (oob_score=true) print "Hyperparameter optimization using GRIDSEARCHCV ..." Grid_search = GRIDSEARCHCV (forest, Grid_test1, N_jobs=-1, cv=10) grid_search.fit (X, y) Best_params_from_grid_sea RCH = Scorereport.report (Grid_search.grid_scores_)

The trained parameters are Params_score = {"N_estimators": 10000, "Max_features": Sqrtfeat, "Min_samples_split": Minsa Mpsplit}, the results are very consistent with the empirical results predicted.

How to evaluate the model after the training? Learning Curves. "Machine learning" in this book, Andrew ng in the Open class also said. The main results are the relationship between deviation variance tradeoff and test error and training error. We should adjust the model to achieve the minimum test error. The Sklearn.learning_curve module can complete this function. The Learning curves curve finally shows that our model requires more data training.

In training, it should be noted that since the number of survivors is relatively small, the data is uneven and a balanced training sample can be obtained by sampling or sampling or by adjusting the weights of the samples.

Then we can use forest to predict the test set, bingo!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Kaggle Data Mining Competition preliminary--titanic <随机森林&特征重要性> __ Data Mining

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Kaggle Data Mining Competition preliminary--titanic <随机森林&特征重要性> __ Data Mining

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support