Recently has been intermittent to do this titanic survival prediction model of the practice, this kaggle contest, many people on the internet have shared, and are very mature, and some write very detailed, I am mainly on the basis of cattle, according to the data mining process to comb ideas, Then practice each step to familiarize yourself with how Python is used for data mining.
The general process of data mining is: Data preview-Data preprocessing (missing values, discrete values, etc.)--Variable transformations (constructing new derivative variables)--Data exploration (extracting features)----training--------
1 Data Preview
1.1 Head ()
Preview the data set in front of the data to see what the value of each field looks like.
1.2 info ()
You can see how many non-null values each field has and what type of field it is.
1.3 Describe ()
You can roughly describe the numerical distribution of each integer or floating-point type, looking at the minimum, maximum, and four-bit numbers to get an overview of the data offset.
2. Data preprocessing
The data is missing from the previous data preview for age, cabin number (Cabin), and Landing port (embarked).
Landing port through the following data exploration can be found only 3 values, and the number of missing is not small, so it is filled with the majority.
Cabin number only 204 have value, in general, the missing proportion of the larger features can be considered to discard, and here Lenovo to the absence of the passenger ticket itself does not have a cabin number, as we bought the ticket, there is no seat number itself, so here first filled with 0
Age field is also missing, in general, sick is to be given special care, so age should be a more important feature, and because it is a continuous value, the algorithm is used to predict the way to fill.
Finally, let's take a look at the populated data situation
3. Data Exploration
3.1 Distribution of individual field values
Look at the code first:
These are canvas-related settings
Subplots_adjust () is used to adjust the interval size of the canvas.
The above is the code to draw each sub-graph in the corresponding position on the canvas. The graphs are as follows:
3.2 explore the relationships between fields and survival, and find features that are useful for the model
3.2.1 The relationship between different passenger levels and survival
The more advanced the class, the greater the proportion of survival. The proportion of those who were not rescued in class 3 was significantly increased. Indicates whether the class is related to the existence of the accommodation.
The relationship between 3.2.2 Sex and survival
From the data, the proportion of rescued women is very high, the film also said that women first, so gender and whether the survival also has a greater relationship.
The relationship between age and survival of 3.2.3
First look at the distribution of age and the dispersion of values
It can be found that most are concentrated in the 20-50-year-old, from the box-line chart to see the average age of nearly 30 years.
Because age is a continuous value, we consider the relationship between age and survival by staging a statistical display of age.
The odds of getting older from the data are bigger. There was a significant difference in survival rates between different age groups, indicating that age was related to survival.
3.2.4 the relationship between brothers and sisters and whether they are alive or not
From the data, siblings have the highest survival rate in 1-2.
3.2.5 whether there is a relationship between parents ' children and survival
The data show that the number of parents and children in 1-3 survival rate is the highest, the more the number is decreased survival rate.
The relationship between 3.2.6 port and survival
Data show that the survival rate of the port is significantly higher. It may be that there are some ports in the middle of the boat and some of the passengers disembark.
This article references: Mr. Big Tree's Blog
Using Python for Titanic survival predictions-data exploration and analysis