Kaggle Contest title--titanic:machine learning from Disaster

Source: Internet
Author: User
Tags ticket

The sinking of the RMS Titanic is one of the very infamous shipwrecks in history. On April, 1912, during she maiden voyage, the Titanic sank after colliding with a iceberg, killing 1502 out of 2224 PA Ssengers and crew. This sensational tragedy shocked the international community and LEDs to better safety regulations for ships.

One of the reasons, the shipwreck led to such loss of life is that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to Survi ve than others, such as women, children, and the Upper-class.

In this challenge, we ask you to complete the analysis of the sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Title Link: https://www.kaggle.com/c/titanic-gettingStarted

Description: The sinking of the Titanic, killing 1504 passengers and crew. An important cause of this tragedy is that life jackets are far from enough. Although there are lucky factors, some groups such as women, children and the upper class are more likely to survive. In this question, we want you to analyze who is more likely to survive.

Know that women and children have priority through prior knowledge of books, movies, etc. The same training data can be used to calculate the survival rate of women.

#!/usr/bin/env python#coding:utf-8 "Created on November 25, 2014 @author:zhaohf" ' Import pandas as Pddf = Pd.read_csv (' ... /data/train.csv ', header=0) female_tourist = len (df[df[' sex '] = = ' female ')) female_survived = Len (df[(df[' sex ') = = ' Female ') & (df[' survived ') = = 1)] Print female_survived * 1.0/female_tourist
The result is: 0.742038216561.

Therefore, it can be rude to think that, as long as women, the basic is able to survive.

Let's use the simple rules above to predict the test data:

tf = Pd.read_csv ('.. /data/test.csv ', header=0) ntf = tf.iloc[:,[0,3]]ntf[' Gender '] = ntf[' Sex '].map ({' Female ': 1, ' Male ': 0}). astype (int) ids = ntf[' Passengerid '].valuespredicts = ntf[' Gender '].valuespredictions_file = Open (". /submissions/gender_submission.csv "," WB ") Open_file_object = Csv.writer (predictions_file) Open_file_ Object.writerow (["Passengerid", "survived"]) open_file_object.writerows (Zip (IDs, predicts)) Predictions_file.close ()
The score after submitting the results is:0.76555.

The accuracy rate is also so-so, guess if added social grade, age, may be more accurate.

Here are some data cleaning, because there are many values are non-numeric or wrong. The following shows the number of non-empty, training data in a total of 891 lines, age, carbin empty more.

Import Pandas as Pddf = Pd.read_csv ('.. /data/train.csv ', header=0) print df.info () int64index:891 entries, 0 to 890Data columns (total columns):P Assengerid
   891 non-null int64survived       891 non-null int64pclass         891 non-null int64name           891 non-null            ObjectSex 891 non-null objectage            714 non-null float64sibsp          891 non-null int64parch 891          non-null int64Ticket         891 Non-null objectfare           891 non-null float64cabin          204 non-null objectembarked       889 non-null objectdtypes: Float64 (2), Int64 (5), Object (5) None

Among them, the name, ticket and other non-digital information in the subsequent processing is not used, will be removed from these columns, while processing Nan. The age of Nan is changed to an average age.

DF = Df.drop ([' Ticket ', ' Name ', ' Cabin ', ' embarked '],axis=1) m = Np.ma.masked_array (df[' age '], Np.isnan (df[' age ')) mean = Np.mean (M). Astype (int) df[' age ' = df[' age '].map (lambda X:mean if Np.isnan (x) Else x) df[' sex '] = df[' sex '].map ({' Femal E ': 1, ' Male ': 0}). astype (int)

The same preprocessing is required for the test file before it can be applied to the model.

The following is a model training using a decision tree.

X = Df.valuesy = df[' survived '].valuesx = Np.delete (X,1,axis=1) X_train, X_test, y_train, y_test = Cross_validation.train_ Test_split (x,y,test_size=0.3,random_state=0) dt = tree. Decisiontreeclassifier (max_depth=5) dt.fit (X_train, y_train) print Dt.score (x_test,y_test)
If the score is more ideal, it is used to predict the test file.

All the code:

#!/usr/bin/env python#coding:utf-8 "Created on November 25, 2014 @author:zhaohf" "Import pandas as Pdimport NumPy as Npfrom SK Learn import treefrom sklearn import Cross_validationimport CSVDF = Pd.read_csv ('.. /data/train.csv ', header=0) df = Df.drop ([' Ticket ', ' Name ', ' Cabin ', ' embarked '],axis=1) m = Np.ma.masked_array (df[' age ' ], Np.isnan (df[' age ')) mean = Np.mean (m). Astype (int.) df[' age ' = df[' age '].map (lambda X:mean if Np.isnan (x) Else x) df[' Se X '] = df[' Sex '].map ({' Female ': 1, ' Male ': 0}). astype (int) X = Df.valuesy = df[' survived '].valuesx = Np.delete (X,1,axis=1) X_train, X_test, y_train, y_test = Cross_validation.train_test_split (x,y,test_size=0.3,random_state=0) dt = tree. Decisiontreeclassifier (max_depth=5) dt.fit (X_train, y_train) print Dt.score (x_test,y_test) test = Pd.read_csv (' ... /data/test.csv ', header=0) tf = Test.drop ([' Ticket ', ' Name ', ' Cabin ', ' embarked '],axis=1) m = Np.ma.masked_array (tf[') Age '], Np.isnan (tf[' age ')) mean = Np.mean (m). Astype (int) tf["Age" = tf[' age '].map (lambda X:mean if np.iSnan (x) Else int (x)) tf[' sex ' = tf[' sex '].map ({' Female ': 1, ' Male ': 0}). astype (int) tf[' Fare '] = tf[' Fare '].map (lambda x : 0 if Np.isnan (x) Else int (x)). Astype (int) predicts = dt.predict (tf) ids = tf[' passengerid '].valuespredictions_file = Open (".. /submissions/dt_submission.csv "," WB ") Open_file_object = Csv.writer (predictions_file) Open_file_object.writerow ([" Passengerid "," survived "]) open_file_object.writerows (Zip (IDs, predicts)) Predictions_file.close ()

The following is the importance of each node of the resulting decision tree, that is, information entropy, the third corresponds to gender, which accounts for more than half of the significance of the gender in this disaster for survival.

[0.06664883 0.14876052 0.52117953 0.10608185 0.08553209 0.00525581 0.06654137]
The final score is 0.78469. is also a very low score, someone can do 100% right!

Kaggle Contest title--titanic:machine learning from Disaster

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.