The previous three posts have been a fairly complete feature engineering, analyzing string-type variables to get new variables, normalize numeric variables, get derived properties and make dimensional specifications. Now that we have a feature set, we can do a training model.
Because this is a classification problem, you can use L1 SVM random forest classification algorithm, random forest is a very simple and practical classification model, adjustable variables are few. A very important variable is the number of trees, the number of trees increased to a certain size will increase the time consuming, but the precision does not increase a lot.
After the previous feature engineering, there are now 237 features, too many features can fit the model, fortunately, random forest after training can produce a variety of characteristics of the importance of the dataset, we can use this dataset, to determine a threshold, selected for the model training to help the largest number of properties, The parameters of the random forest used here are the default values.
1 X = input_df.values[:, 1:: 2 y = input_df.values[:, 0] 3 Survived_weight =. 4 Y_weights = Np.arr Ay ([survived_weight if s = = 0 Else 1 for s on y]) 5 6 print "Rough fitting a randomforest to determine feature Impo
Rtance ... "
7 forest = Randomforestclassifier (Oob_score=true, n_estimators=10000) 8 Forest.fit (X, y, sample_weight=y_weights) 9 Feature_importance = Forest.feature_importances_ feature_importance = 100.0 * (feature_importance/feature _importance.max ()) Fi_threshold = Important_idx = Np.where (Feature_importance > Fi_threshold) [ 0] important_features = features_list[important_idx] print "\ n", important_features.shape[0], "important FEA Tures (> ", Fi_threshold,"% of max importance) ... \ n "#, #important_features sorted _idx = Np.argsort (Feature_importance[important_idx]) [:: -1] #get The figure about important features POS = NP . Arange (sortEd_idx.shape[0]) +. 5 Plt.subplot (1, 2, 2) Plt.title (' Feature importance ') Plt.barh (POS, Feature_impo Rtance[important_idx][sorted_idx[::-1]], color= ' R ', align= ' center ') plt.yticks (POS, Important_featu RES[SORTED_IDX[::-1]]) Plt.xlabel (' relative importance ') Plt.draw () plt.show ()
The code is a bit long, but mainly divided into two, one is model training, the other is based on the importance of training to screen important features and drawing.
The attributes that are more important than 18 are obtained as shown in the following illustration:
It is important to see the three properties of TILTLE_MR title_id gender. and the title related to the attributes are our analysis of the name, can be seen in some string properties may be hidden in a very important information, in special projects to pay attention to rather than discard it. Because our original attributes are very small, most of the important attributes produced are the mathematical combination of the original attribute, which may not be required, which is primarily related to the model, but most of the time it is harmless to derive the variable. For the random forest training data to the easy model, perhaps some of the original properties directly used for training will also have a very good effect, but as a learning problem, of course, what the approach to try again, accumulate experience.
For a random forest how to get the importance of the face, you can look at the official documents of Scikit Learn Scikit-learn.org/stable/auto_examples/ensemble/plot_forest_ Importances.html#example-ensemble-plot-forest-importances-py
Of course, after getting the important features, we have to remove the unimportant features to improve the model's training speed (the threshold can be adjusted slightly to retain more features)
1 X = x[:, important_idx][:, Sorted_idx]
2 submit_df = Submit_df.iloc[:,important_idx].iloc[:,sorted_idx ]
Now we have the final dataset, which can finally be formally used to train the model.
The above section is used for the default parameters of random forests, but the parameters of the model are adjustable and we need to adjust the parameters for better training. Scikit Learn provides two methods for parameter optimization, and is also a common method for other tools, one is Gridsearch and the other is randomizedsearch. In both cases, you can specify a range of values for each parameter and create a dictionary. Provides a parameter dictionary to the search method, which executes a combination of the values specified by the model. For Gridsearch, it tests each possible combination of parameters. Randomizedsearch allows you to specify how many different combinations to test, and then randomly select a combination. If you are using a lot of model key parameters, Randomizedsearch is useful to help save time.
1 sqrtfeat = Int (np.sqrt (x.shape[1))) 2 minsampsplit = Int (x.shape[0]*0.015) 3 # (adapted from http://scikit-learn.org/ stable/auto_examples/randomized_search.html) 4 def (Grid_scores, n_top=5): 5 params = None 6 top_scores = Sorted (Grid_scores, Key=itemgetter (1), reverse=true) [: N_top] 7 for-I, score in Enumerate (top_scores): 8 pri NT ("Parameters with rank: {0}". Format (i + 1)) 9 print ("Mean validation score: {0:.4f} (std: {1:.4f})". Format (10 Score.mean_validation_score, NP.STD (score.cv_validation_scores)) One print ("Parameters: {0}". Format (
score.parameters)) ("") [params = = none:15 Params = score.parameters Params # simple grid Test Grid_test1 = {"N_estimators": [1000, 2500, 5000], 20 "Criterion": ["Gini", "Entropy"], "max_features": [Sqrtfeat-1, Sqrtfeat, sqrtfeat +1], 22 "Max_depth ": [5]," Min_samples_split ": [2, 5, 10,minsampsplit]} forest = Rando
Mforestclassifier (oob_score=true) print "Hyperparameter optimization using GRIDSEARCHCV ..." Grid_search = GRIDSEARCHCV (forest, Grid_test1, N_jobs=-1, cv=10) grid_search.fit (X, y) Best_params_from_grid_sea RCH = Scorereport.report (Grid_search.grid_scores_)
The trained parameters are Params_score = {"N_estimators": 10000, "Max_features": Sqrtfeat, "Min_samples_split": Minsa Mpsplit}, the results are very consistent with the empirical results predicted.
How to evaluate the model after the training? Learning Curves. "Machine learning" in this book, Andrew ng in the Open class also said. The main results are the relationship between deviation variance tradeoff and test error and training error. We should adjust the model to achieve the minimum test error. The Sklearn.learning_curve module can complete this function. The Learning curves curve finally shows that our model requires more data training.
In training, it should be noted that since the number of survivors is relatively small, the data is uneven and a balanced training sample can be obtained by sampling or sampling or by adjusting the weights of the samples.
Then we can use forest to predict the test set, bingo!