The sinking of the RMS Titanic is one of the very infamous shipwrecks in history. On April, 1912, during she maiden voyage, the Titanic sank after colliding with a iceberg, killing 1502 out of 2224 PA Ssengers and crew. This sensational tragedy shocked the international community and LEDs to better safety regulations for ships.
One of the reasons, the shipwreck led to such loss of life is that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to Survi ve than others, such as women, children, and the Upper-class.
In this challenge, we ask you to complete the analysis of the sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
Title Link: https://www.kaggle.com/c/titanic-gettingStarted
Description: The sinking of the Titanic, killing 1504 passengers and crew. An important cause of this tragedy is that life jackets are far from enough. Although there are lucky factors, some groups such as women, children and the upper class are more likely to survive. In this question, we want you to analyze who is more likely to survive.
Know that women and children have priority through prior knowledge of books, movies, etc. The same training data can be used to calculate the survival rate of women.
#!/usr/bin/env python#coding:utf-8 "Created on November 25, 2014 @author:zhaohf" ' Import pandas as Pddf = Pd.read_csv (' ... /data/train.csv ', header=0) female_tourist = len (df[df[' sex '] = = ' female ')) female_survived = Len (df[(df[' sex ') = = ' Female ') & (df[' survived ') = = 1)] Print female_survived * 1.0/female_tourist
The result is: 0.742038216561.
Therefore, it can be rude to think that, as long as women, the basic is able to survive.
Let's use the simple rules above to predict the test data:
tf = Pd.read_csv ('.. /data/test.csv ', header=0) ntf = tf.iloc[:,[0,3]]ntf[' Gender '] = ntf[' Sex '].map ({' Female ': 1, ' Male ': 0}). astype (int) ids = ntf[' Passengerid '].valuespredicts = ntf[' Gender '].valuespredictions_file = Open (". /submissions/gender_submission.csv "," WB ") Open_file_object = Csv.writer (predictions_file) Open_file_ Object.writerow (["Passengerid", "survived"]) open_file_object.writerows (Zip (IDs, predicts)) Predictions_file.close ()
The score after submitting the results is:0.76555.
The accuracy rate is also so-so, guess if added social grade, age, may be more accurate.
Here are some data cleaning, because there are many values are non-numeric or wrong. The following shows the number of non-empty, training data in a total of 891 lines, age, carbin empty more.
Import Pandas as Pddf = Pd.read_csv ('.. /data/train.csv ', header=0) print df.info () int64index:891 entries, 0 to 890Data columns (total columns):P Assengerid
891 non-null int64survived 891 non-null int64pclass 891 non-null int64name 891 non-null ObjectSex 891 non-null objectage 714 non-null float64sibsp 891 non-null int64parch 891 non-null int64Ticket 891 Non-null objectfare 891 non-null float64cabin 204 non-null objectembarked 889 non-null objectdtypes: Float64 (2), Int64 (5), Object (5) None
Among them, the name, ticket and other non-digital information in the subsequent processing is not used, will be removed from these columns, while processing Nan. The age of Nan is changed to an average age.
DF = Df.drop ([' Ticket ', ' Name ', ' Cabin ', ' embarked '],axis=1) m = Np.ma.masked_array (df[' age '], Np.isnan (df[' age ')) mean = Np.mean (M). Astype (int) df[' age ' = df[' age '].map (lambda X:mean if Np.isnan (x) Else x) df[' sex '] = df[' sex '].map ({' Femal E ': 1, ' Male ': 0}). astype (int)
The same preprocessing is required for the test file before it can be applied to the model.
The following is a model training using a decision tree.
X = Df.valuesy = df[' survived '].valuesx = Np.delete (X,1,axis=1) X_train, X_test, y_train, y_test = Cross_validation.train_ Test_split (x,y,test_size=0.3,random_state=0) dt = tree. Decisiontreeclassifier (max_depth=5) dt.fit (X_train, y_train) print Dt.score (x_test,y_test)
If the score is more ideal, it is used to predict the test file.
All the code:
#!/usr/bin/env python#coding:utf-8 "Created on November 25, 2014 @author:zhaohf" "Import pandas as Pdimport NumPy as Npfrom SK Learn import treefrom sklearn import Cross_validationimport CSVDF = Pd.read_csv ('.. /data/train.csv ', header=0) df = Df.drop ([' Ticket ', ' Name ', ' Cabin ', ' embarked '],axis=1) m = Np.ma.masked_array (df[' age ' ], Np.isnan (df[' age ')) mean = Np.mean (m). Astype (int.) df[' age ' = df[' age '].map (lambda X:mean if Np.isnan (x) Else x) df[' Se X '] = df[' Sex '].map ({' Female ': 1, ' Male ': 0}). astype (int) X = Df.valuesy = df[' survived '].valuesx = Np.delete (X,1,axis=1) X_train, X_test, y_train, y_test = Cross_validation.train_test_split (x,y,test_size=0.3,random_state=0) dt = tree. Decisiontreeclassifier (max_depth=5) dt.fit (X_train, y_train) print Dt.score (x_test,y_test) test = Pd.read_csv (' ... /data/test.csv ', header=0) tf = Test.drop ([' Ticket ', ' Name ', ' Cabin ', ' embarked '],axis=1) m = Np.ma.masked_array (tf[') Age '], Np.isnan (tf[' age ')) mean = Np.mean (m). Astype (int) tf["Age" = tf[' age '].map (lambda X:mean if np.iSnan (x) Else int (x)) tf[' sex ' = tf[' sex '].map ({' Female ': 1, ' Male ': 0}). astype (int) tf[' Fare '] = tf[' Fare '].map (lambda x : 0 if Np.isnan (x) Else int (x)). Astype (int) predicts = dt.predict (tf) ids = tf[' passengerid '].valuespredictions_file = Open (".. /submissions/dt_submission.csv "," WB ") Open_file_object = Csv.writer (predictions_file) Open_file_object.writerow ([" Passengerid "," survived "]) open_file_object.writerows (Zip (IDs, predicts)) Predictions_file.close ()
The following is the importance of each node of the resulting decision tree, that is, information entropy, the third corresponds to gender, which accounts for more than half of the significance of the gender in this disaster for survival.
[0.06664883 0.14876052 0.52117953 0.10608185 0.08553209 0.00525581 0.06654137]
The final score is 0.78469. is also a very low score, someone can do 100% right!
Kaggle Contest title--titanic:machine learning from Disaster