Titanic is a kaggle on the just for fun, no bonuses, but the data neat, practiced hand best to bring.
Based on Titanic data, this paper uses a simple decision tree to introduce the process and procedure of processing data.
Note that the purpose of this article is to help you get started with data mining, to be familiar with data steps, processes
Decision tree model is a simple and easy-to-use non-parametric classifier. It does not require any prior assumptions about the data, the computation is faster, the results are easy to interpret, and the robustness is strong, insensitive to noise data and missing data. The following example uses the data set in the Kaggle contest Titanic to classify the decision tree, the target variable is survive
Reading data
importas npimportas pddf = pd.read_csv(‘train.csv‘, header=0)
Data collation
- Only three independent variables are taken out of the
- Complement age-missing data
- Convert Pclass variable to three summy variable
- Convert Sex to 0-1 variables
subdf = df[[‘Pclass‘,‘Sex‘,‘Age‘]]y = df.Survived# sklearn中的Imputer也可以age = subdf[‘Age‘].fillna(value=subdf.Age.mean())# sklearn OneHotEncoder也可以pclass = pd.get_dummies(subdf[‘Pclass‘],prefix=‘Pclass‘)sex = (subdf[‘Sex‘]==‘male‘).astype(‘int‘)X = pd.concat([pclass,age,sex],axis=1)X.head()
Output results
Building a model
- Cut the data into train and test
fromimport train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
- Observing decision tree performance on test set
fromimport treeclf = tree.DecisionTreeClassifier(criterion=‘entropy‘, max_depth=3,min_samples_leaf=5)clf = clf.fit(X_train,y_train)print("准确率为:{:.2f}".format(clf.score(X_test,y_test)))
The output is as follows
准确率为:0.83
- Observing the importance of each variable
clf.feature_importances_
The output is as follows
array([ 0.08398076, 0. , 0.23320717, 0.10534824, 0.57746383])
importas pltfeature_importance = clf.feature_importances_important_features = X_train.columns.values[0100.0 * (feature_importance / feature_importance.max())sorted_idx = np.argsort(feature_importance)[::-1]pos = np.arange(sorted_idx.shape[0.5plt.title(‘Feature Importance‘)plt.barh(pos, feature_importance[sorted_idx[::-1]], color=‘r‘,align=‘center‘)plt.yticks(pos, important_features)plt.xlabel(‘Relative Importance‘)plt.draw()plt.show()
For the importance of how random forests get variables, you can see Scikit-learn official documents
Of course, after getting the important features, we can remove the unimportant features to improve the training speed of the model.
Finally,
- Using cross-validation to evaluate a model
fromimport cross_validationscores1 = cross_validation.cross_val_score(clf, X, y, cv=10)scores1
The output results are as follows:
array([ 0.82222222, 0.82222222, 0.7752809 , 0.87640449, 0.82022472, 0.76404494, 0.7752809 , 0.76404494, 0.83146067, 0.78409091])
- Use more metrics to evaluate models
fromSklearnImportMetrics def measure_performance(X,Y,CLF, Show_accuracy=true, show_classification _report=true, Show_confusion_matrix=true):Y_pred=clf.predict (X)ifShow_accuracy:print ("accuracy:{0:.3f}". Format (Metrics.accuracy_score (y,y_pred)),"\ n")ifShow_classification_report:print ("Classification report") Print (Metrics.classification_report (y,y_pred),"\ n")ifShow_confusion_matrix:print ("Confusion Matrix") Print (Metrics.confusion_matrix (y,y_pred),"\ n") measure_performance (X_TEST,Y_TEST,CLF, show_classification_report=True, show_confusion_matrix=True)
The output is as follows, and you can see more features such as precision (accuracy) recall (recall rate)
Accuracy:0.834 Classification report precision recall f1-score support 0 0.85 0.88 0.86 134 1 0.81 0.76 0.79 89avg / total 0.83 0.83 0.83 223Confusion matrix[[118 16] [ 21
Compare with Random forest
fromimport RandomForestClassifierclf2 = RandomForestClassifier(n_estimators=1000,random_state=33)clf2 = clf2.fit(X_train,y_train)scores2 = cross_validation.cross_val_score(clf2,X, y, cv=10)clf2.feature_importances_scores2.mean(), scores1.mean()
Accuracy output (the average of 10 percent cross-validation is used here)
(0.81262938372488946, 0.80352769265690616)
You can see that the random forest is about 0.1 higher than the decision tree.
Summarize
Through the above analysis, we have passed a data scientist to get the data to reach the conclusion of all the steps
- Read in Data
- Data cleanup
- Feature Engineering
- Building a model
- Model evaluation
- Parameter adjustment
- Model comparison
This article is not about results, it's about helping you understand the process of processing the data, the steps
The rest of the details is that you play your own imagination, improve and innovate.
Reference links
Python's decision tree and random forest
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Kaggle Data Mining--taking Titanic as an example to introduce the approximate steps of processing data