Kaggle Data Mining--taking Titanic as an example to introduce the approximate steps of processing data

Last Update:2015-07-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Titanic is a kaggle on the just for fun, no bonuses, but the data neat, practiced hand best to bring.

Based on Titanic data, this paper uses a simple decision tree to introduce the process and procedure of processing data.

Note that the purpose of this article is to help you get started with data mining, to be familiar with data steps, processes

Decision tree model is a simple and easy-to-use non-parametric classifier. It does not require any prior assumptions about the data, the computation is faster, the results are easy to interpret, and the robustness is strong, insensitive to noise data and missing data. The following example uses the data set in the Kaggle contest Titanic to classify the decision tree, the target variable is survive

Reading data

importas npimportas pddf = pd.read_csv(‘train.csv‘, header=0)

Data collation

Only three independent variables are taken out of the
Complement age-missing data
Convert Pclass variable to three summy variable
Convert Sex to 0-1 variables

subdf = df[[‘Pclass‘,‘Sex‘,‘Age‘]]y = df.Survived# sklearn中的Imputer也可以age = subdf[‘Age‘].fillna(value=subdf.Age.mean())# sklearn OneHotEncoder也可以pclass = pd.get_dummies(subdf[‘Pclass‘],prefix=‘Pclass‘)sex = (subdf[‘Sex‘]==‘male‘).astype(‘int‘)X = pd.concat([pclass,age,sex],axis=1)X.head()

Output results

Building a model

Cut the data into train and test

fromimport train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

Observing decision tree performance on test set

fromimport treeclf = tree.DecisionTreeClassifier(criterion=‘entropy‘, max_depth=3,min_samples_leaf=5)clf = clf.fit(X_train,y_train)print("准确率为：{:.2f}".format(clf.score(X_test,y_test)))

The output is as follows

准确率为：0.83

Observing the importance of each variable

clf.feature_importances_

The output is as follows

array([ 0.08398076,  0.        ,  0.23320717,  0.10534824,  0.57746383])

Generating a feature map

importas pltfeature_importance = clf.feature_importances_important_features = X_train.columns.values[0100.0 * (feature_importance / feature_importance.max())sorted_idx = np.argsort(feature_importance)[::-1]pos = np.arange(sorted_idx.shape[0.5plt.title(‘Feature Importance‘)plt.barh(pos, feature_importance[sorted_idx[::-1]], color=‘r‘,align=‘center‘)plt.yticks(pos, important_features)plt.xlabel(‘Relative Importance‘)plt.draw()plt.show()

For the importance of how random forests get variables, you can see Scikit-learn official documents

Of course, after getting the important features, we can remove the unimportant features to improve the training speed of the model.

Finally,

Using cross-validation to evaluate a model

fromimport cross_validationscores1 = cross_validation.cross_val_score(clf, X, y, cv=10)scores1

The output results are as follows:

array([ 0.82222222,  0.82222222,  0.7752809 ,  0.87640449,  0.82022472,    0.76404494,  0.7752809 ,  0.76404494,  0.83146067,  0.78409091])

Use more metrics to evaluate models

 fromSklearnImportMetrics def measure_performance(X,Y,CLF, Show_accuracy=true, show_classification _report=true, Show_confusion_matrix=true):Y_pred=clf.predict (X)ifShow_accuracy:print ("accuracy:{0:.3f}". Format (Metrics.accuracy_score (y,y_pred)),"\ n")ifShow_classification_report:print ("Classification report") Print (Metrics.classification_report (y,y_pred),"\ n")ifShow_confusion_matrix:print ("Confusion Matrix") Print (Metrics.confusion_matrix (y,y_pred),"\ n") measure_performance (X_TEST,Y_TEST,CLF, show_classification_report=True, show_confusion_matrix=True)

The output is as follows, and you can see more features such as precision (accuracy) recall (recall rate)

Accuracy:0.834 Classification report             precision    recall  f1-score   support          0       0.85      0.88      0.86       134          1       0.81      0.76      0.79        89avg / total       0.83      0.83      0.83       223Confusion matrix[[118  16] [ 21

Compare with Random forest

fromimport RandomForestClassifierclf2 = RandomForestClassifier(n_estimators=1000,random_state=33)clf2 = clf2.fit(X_train,y_train)scores2 = cross_validation.cross_val_score(clf2,X, y, cv=10)clf2.feature_importances_scores2.mean(), scores1.mean()

Accuracy output (the average of 10 percent cross-validation is used here)

(0.81262938372488946, 0.80352769265690616)

You can see that the random forest is about 0.1 higher than the decision tree.

Summarize

Through the above analysis, we have passed a data scientist to get the data to reach the conclusion of all the steps

Read in Data
Data cleanup
Feature Engineering
Building a model
Model evaluation
Parameter adjustment
Model comparison

This article is not about results, it's about helping you understand the process of processing the data, the steps

The rest of the details is that you play your own imagination, improve and innovate.

Reference links

Python's decision tree and random forest

Kaggle Data Mining--taking Titanic as an example to introduce the approximate steps of processing data

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Kaggle Data Mining--taking Titanic as an example to introduce the approximate steps of processing data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Kaggle Data Mining--taking Titanic as an example to introduce the approximate steps of processing data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support