1 Data exploration
A holistic understanding of the data
1.1 Viewing data What are some of the characteristics
Import Pandas as PD Import Seaborn as SNS%= pd.read_csv ('g:\\titanic\\train.csv')
Titanic.sample (10)
Get the data 10 rows of records to observe, preliminary understanding of the composition of the data, you can see that the age, cabin inside is missing values, after further understanding the statistics of the data and then data processing, observe the maximum minimum value of each feature, can be found that these data is reasonable, there is no special outliers.
Print (Titanic.describe ())
#查看常用的统计量
2 Data analysis \ Processing
Name and ticket based on basic cognition, it is not related to whether the passenger has the chance to survive, so it ignores these two characteristics for the time being. Because cabin this one characteristic missing value is more, the reference value is low, therefore also temporarily shelved.
2.1 Sex feature Processing
Sex is divided into female and male, but some algorithmic models only recognize numbers, so they are represented by 0 and 1 respectively.
Titanic. Sex = Titanic. Sex.replace ("male", 1= Titanic. Sex.replace ("female", 0)
2.2 Age feature processing
Age there are missing values, there are 714 rows in the old record, where the average of the ages is used to fill missing values
Titanic. Age = titanic['age'].fillna (Titanic. Age.mean ())
2.3 embarked feature processing
Replace the embarked s C q with 0 1 2 respectively
Titanic. embarked = Titanic. Embarked.replace ("S", 0= Titanic. Embarked.replace ("C", 1= Titanic. Embarked.replace ("Q", 2)
View embarked feature statistics found that he has missing values, where they replace missing values with the majority
Titanic. embarked = titanic["embarked"].fillna (0)
3 Feature Engineering
The correlation between the characteristics and the survived is observed by the heat-seeking force
info = ["survived","Passengerid","Pclass","Sex"," Age","sibsp","Parch","Fare","embarked"]sns.heatmap (Titanic[info].corr (), Annot=true,vmin = 0, Vmax = 1)
The correlation between Pclass, Sex, Fare, embarked and survived is relatively strong according to the Heat diagram, so the characteristics are studied in the machine learning model.
4 Model Learning/evaluation
Import NumPy as NP from Import Linear_model from Import Cross_val_score
x = titanic[["Pclass", "Sex", "Age", "Fare", "embarked"]
y = titanic["survived"]
The method of cross-examination is used to evaluate the model with the average value.
4.1 Logistic regression
LM ="accuracy")print(Np.mean (scores))
4.2 k Nearest Neighbor
from Import "uniform " "accuracy") Print(Np.mean (Score)
4.3 Decision Tree
from Import ="accuracy")print(Np.mean (scores))
4.4 Random Forest
from Import = Ensemble. randomforestclassifier ("accuracy")print(Np.mean (scores ))
4.5 GBDT
GBDT ="accuracy")print(Np.mean (scores))
5 Summary
Through the data exploration, data processing, common machine learning model comparison, finally can be found GBDT in the Titanic survival prediction above the best performance, accuracy can reach more than 82%.
Titanic Survival Prediction (Python)