Kaggle Data Mining -- Take Titanic as an example to introduce the general steps of data processing, kaggletitanic

Source: Internet
Author: User

Kaggle Data Mining -- Take Titanic as an example to introduce the general steps of data processing, kaggletitanic

Titanic is a just for fun question on kaggle, there is no bonus, but the data is neat, it is best to practice it.

This article uses Titanic data and uses a simple decision tree to introduce the general process and steps of data processing.

Note: The purpose of this article is to help you get started with Data Mining and familiarize yourself with data processing steps and procedures.

The decision tree model is a simple and easy-to-use non-parameter classifier. It does not need to have any prior assumptions about the data. The computation speed is fast, the results are easy to explain, and the data is robust and not sensitive to noisy data and missing data. The following example uses the data set in the titanic of the kaggle competition for decision tree classification, and the target variable is keep ve

Read data
import numpy as npimport pandas as pddf = pd.read_csv('train.csv', header=0)
Data Sorting
  • Retrieve only three independent variables
  • Complete the missing Age data
  • Convert the Pclass variable to three Summy Variables
  • Convert sex into a 0-1 variable
Subdf = df [['pclass ', 'sex', 'age'] y = df. imputer in replicated ved # sklearn can also be age = subdf ['age']. fillna (value = subdf. age. mean () # sklearn OneHotEncoder can also be pclass = pd. get_dummies (subdf ['pclass '], prefix = 'pclass') sex = (subdf ['sex'] = 'male '). astype ('int') X = pd. concat ([pclass, age, sex], axis = 1) X. head ()

Output result

Model Creation
  • Split data into train and test
from sklearn.cross_validation import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
  • Observe the performance of the decision tree on the test
From sklearn import treeclf = tree. decisionTreeClassifier (criterion = 'entropy', max_depth = 3, min_samples_leaf = 5) clf = clf. fit (X_train, y_train) print ("Accuracy :{:. 2f }". format (clf. score (X_test, y_test )))

The output result is as follows:

Accuracy: 0.83
  • Observe the importance of each variable
clf.feature_importances_

Output:

array([ 0.08398076,  0.        ,  0.23320717,  0.10534824,  0.57746383])
  • Generate a Feature Map
import matplotlib.pyplot as pltfeature_importance = clf.feature_importances_important_features = X_train.columns.values[0::]feature_importance = 100.0 * (feature_importance / feature_importance.max())sorted_idx = np.argsort(feature_importance)[::-1]pos = np.arange(sorted_idx.shape[0]) + .5plt.title('Feature Importance')plt.barh(pos, feature_importance[sorted_idx[::-1]], color='r',align='center')plt.yticks(pos, important_features)plt.xlabel('Relative Importance')plt.draw()plt.show()

For the importance of how to obtain variables in the random forest, see the official scikit-learn documentation.

Of course, after getting important features, we can remove unimportant features to increase the model training speed.

Finally

  • Use cross-validation to evaluate the Model
from sklearn import cross_validationscores1 = cross_validation.cross_val_score(clf, X, y, cv=10)scores1

The output result is as follows:

array([ 0.82222222,  0.82222222,  0.7752809 ,  0.87640449,  0.82022472,    0.76404494,  0.7752809 ,  0.76404494,  0.83146067,  0.78409091])
  • Use more metrics to evaluate the Model
from sklearn import metricsdef measure_performance(X,y,clf, show_accuracy=True,                         show_classification_report=True,                         show_confusion_matrix=True):    y_pred=clf.predict(X)       if show_accuracy:        print("Accuracy:{0:.3f}".format(metrics.accuracy_score(y,y_pred)),"\n")    if show_classification_report:        print("Classification report")        print(metrics.classification_report(y,y_pred),"\n")    if show_confusion_matrix:        print("Confusion matrix")        print(metrics.confusion_matrix(y,y_pred),"\n")measure_performance(X_test,y_test,clf, show_classification_report=True, show_confusion_matrix=True)

The output result is as follows. We can see more features such as precision recall.

Accuracy:0.834 Classification report             precision    recall  f1-score   support          0       0.85      0.88      0.86       134          1       0.81      0.76      0.79        89avg / total       0.83      0.83      0.83       223Confusion matrix[[118  16] [ 21  68]] 
Comparison with random Forest
from sklearn.ensemble import RandomForestClassifierclf2 = RandomForestClassifier(n_estimators=1000,random_state=33)clf2 = clf2.fit(X_train,y_train)scores2 = cross_validation.cross_val_score(clf2,X, y, cv=10)clf2.feature_importances_scores2.mean(), scores1.mean()

Accuracy output (Here we use the average value after off cross verification)

(0.81262938372488946, 0.80352769265690616)

The accuracy of the random forest is about 0.1 higher than that of the decision tree.

Summary

After the above analysis, we have gone through all the steps for a data scientist to get the data and reach a conclusion.

What is important in this article is not the result. It helps you understand the general process and steps of data processing.

The remaining details are your imagination, improvement, and innovation.

Reference

Decision tree and random forest in python

Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.