Kaggle Contest title--titanic:machine learning from Disaster

Last Update:2014-11-25 Source: Internet

Author: User

Tags ticket

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The sinking of the RMS Titanic is one of the very infamous shipwrecks in history. On April, 1912, during she maiden voyage, the Titanic sank after colliding with a iceberg, killing 1502 out of 2224 PA Ssengers and crew. This sensational tragedy shocked the international community and LEDs to better safety regulations for ships.

One of the reasons, the shipwreck led to such loss of life is that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to Survi ve than others, such as women, children, and the Upper-class.

In this challenge, we ask you to complete the analysis of the sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Title Link: https://www.kaggle.com/c/titanic-gettingStarted

Description: The sinking of the Titanic, killing 1504 passengers and crew. An important cause of this tragedy is that life jackets are far from enough. Although there are lucky factors, some groups such as women, children and the upper class are more likely to survive. In this question, we want you to analyze who is more likely to survive.

Know that women and children have priority through prior knowledge of books, movies, etc. The same training data can be used to calculate the survival rate of women.

#!/usr/bin/env python#coding:utf-8 "Created on November 25, 2014 @author:zhaohf" ' Import pandas as Pddf = Pd.read_csv (' ... /data/train.csv ', header=0) female_tourist = len (df[df[' sex '] = = ' female ')) female_survived = Len (df[(df[' sex ') = = ' Female ') & (df[' survived ') = = 1)] Print female_survived * 1.0/female_tourist

The result is: 0.742038216561.

Therefore, it can be rude to think that, as long as women, the basic is able to survive.

Let's use the simple rules above to predict the test data:

tf = Pd.read_csv ('.. /data/test.csv ', header=0) ntf = tf.iloc[:,[0,3]]ntf[' Gender '] = ntf[' Sex '].map ({' Female ': 1, ' Male ': 0}). astype (int) ids = ntf[' Passengerid '].valuespredicts = ntf[' Gender '].valuespredictions_file = Open (". /submissions/gender_submission.csv "," WB ") Open_file_object = Csv.writer (predictions_file) Open_file_ Object.writerow (["Passengerid", "survived"]) open_file_object.writerows (Zip (IDs, predicts)) Predictions_file.close ()

The score after submitting the results is:0.76555.

The accuracy rate is also so-so, guess if added social grade, age, may be more accurate.

Here are some data cleaning, because there are many values are non-numeric or wrong. The following shows the number of non-empty, training data in a total of 891 lines, age, carbin empty more.

Import Pandas as Pddf = Pd.read_csv ('.. /data/train.csv ', header=0) print df.info () int64index:891 entries, 0 to 890Data columns (total columns):P Assengerid
   891 non-null int64survived       891 non-null int64pclass         891 non-null int64name           891 non-null            ObjectSex 891 non-null objectage            714 non-null float64sibsp          891 non-null int64parch 891          non-null int64Ticket         891 Non-null objectfare           891 non-null float64cabin          204 non-null objectembarked       889 non-null objectdtypes: Float64 (2), Int64 (5), Object (5) None

Among them, the name, ticket and other non-digital information in the subsequent processing is not used, will be removed from these columns, while processing Nan. The age of Nan is changed to an average age.

DF = Df.drop ([' Ticket ', ' Name ', ' Cabin ', ' embarked '],axis=1) m = Np.ma.masked_array (df[' age '], Np.isnan (df[' age ')) mean = Np.mean (M). Astype (int) df[' age ' = df[' age '].map (lambda X:mean if Np.isnan (x) Else x) df[' sex '] = df[' sex '].map ({' Femal E ': 1, ' Male ': 0}). astype (int)

The same preprocessing is required for the test file before it can be applied to the model.

The following is a model training using a decision tree.

X = Df.valuesy = df[' survived '].valuesx = Np.delete (X,1,axis=1) X_train, X_test, y_train, y_test = Cross_validation.train_ Test_split (x,y,test_size=0.3,random_state=0) dt = tree. Decisiontreeclassifier (max_depth=5) dt.fit (X_train, y_train) print Dt.score (x_test,y_test)

If the score is more ideal, it is used to predict the test file.

All the code:

#!/usr/bin/env python#coding:utf-8 "Created on November 25, 2014 @author:zhaohf" "Import pandas as Pdimport NumPy as Npfrom SK Learn import treefrom sklearn import Cross_validationimport CSVDF = Pd.read_csv ('.. /data/train.csv ', header=0) df = Df.drop ([' Ticket ', ' Name ', ' Cabin ', ' embarked '],axis=1) m = Np.ma.masked_array (df[' age ' ], Np.isnan (df[' age ')) mean = Np.mean (m). Astype (int.) df[' age ' = df[' age '].map (lambda X:mean if Np.isnan (x) Else x) df[' Se X '] = df[' Sex '].map ({' Female ': 1, ' Male ': 0}). astype (int) X = Df.valuesy = df[' survived '].valuesx = Np.delete (X,1,axis=1) X_train, X_test, y_train, y_test = Cross_validation.train_test_split (x,y,test_size=0.3,random_state=0) dt = tree. Decisiontreeclassifier (max_depth=5) dt.fit (X_train, y_train) print Dt.score (x_test,y_test) test = Pd.read_csv (' ... /data/test.csv ', header=0) tf = Test.drop ([' Ticket ', ' Name ', ' Cabin ', ' embarked '],axis=1) m = Np.ma.masked_array (tf[') Age '], Np.isnan (tf[' age ')) mean = Np.mean (m). Astype (int) tf["Age" = tf[' age '].map (lambda X:mean if np.iSnan (x) Else int (x)) tf[' sex ' = tf[' sex '].map ({' Female ': 1, ' Male ': 0}). astype (int) tf[' Fare '] = tf[' Fare '].map (lambda x : 0 if Np.isnan (x) Else int (x)). Astype (int) predicts = dt.predict (tf) ids = tf[' passengerid '].valuespredictions_file = Open (".. /submissions/dt_submission.csv "," WB ") Open_file_object = Csv.writer (predictions_file) Open_file_object.writerow ([" Passengerid "," survived "]) open_file_object.writerows (Zip (IDs, predicts)) Predictions_file.close ()

The following is the importance of each node of the resulting decision tree, that is, information entropy, the third corresponds to gender, which accounts for more than half of the significance of the gender in this disaster for survival.

[0.06664883 0.14876052 0.52117953 0.10608185 0.08553209 0.00525581 0.06654137]
The final score is 0.78469. is also a very low score, someone can do 100% right!

Kaggle Contest title--titanic:machine learning from Disaster

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More