Using Python for Titanic survival predictions-data exploration and analysis

Source: Internet
Author: User
Tags ticket

Recently has been intermittent to do this titanic survival prediction model of the practice, this kaggle contest, many people on the internet have shared, and are very mature, and some write very detailed, I am mainly on the basis of cattle, according to the data mining process to comb ideas, Then practice each step to familiarize yourself with how Python is used for data mining.

The general process of data mining is: Data preview-Data preprocessing (missing values, discrete values, etc.)--Variable transformations (constructing new derivative variables)--Data exploration (extracting features)----training--------

1 Data Preview

1.1 Head ()

Preview the data set in front of the data to see what the value of each field looks like.

1.2 info ()

You can see how many non-null values each field has and what type of field it is.

1.3 Describe ()

You can roughly describe the numerical distribution of each integer or floating-point type, looking at the minimum, maximum, and four-bit numbers to get an overview of the data offset.

2. Data preprocessing

The data is missing from the previous data preview for age, cabin number (Cabin), and Landing port (embarked).

Landing port through the following data exploration can be found only 3 values, and the number of missing is not small, so it is filled with the majority.

Cabin number only 204 have value, in general, the missing proportion of the larger features can be considered to discard, and here Lenovo to the absence of the passenger ticket itself does not have a cabin number, as we bought the ticket, there is no seat number itself, so here first filled with 0

Age field is also missing, in general, sick is to be given special care, so age should be a more important feature, and because it is a continuous value, the algorithm is used to predict the way to fill.

Finally, let's take a look at the populated data situation

3. Data Exploration

3.1 Distribution of individual field values

Look at the code first:

These are canvas-related settings

Subplots_adjust () is used to adjust the interval size of the canvas.

The above is the code to draw each sub-graph in the corresponding position on the canvas. The graphs are as follows:

3.2 explore the relationships between fields and survival, and find features that are useful for the model

3.2.1 The relationship between different passenger levels and survival

The more advanced the class, the greater the proportion of survival. The proportion of those who were not rescued in class 3 was significantly increased. Indicates whether the class is related to the existence of the accommodation.

The relationship between 3.2.2 Sex and survival

From the data, the proportion of rescued women is very high, the film also said that women first, so gender and whether the survival also has a greater relationship.

The relationship between age and survival of 3.2.3

First look at the distribution of age and the dispersion of values

It can be found that most are concentrated in the 20-50-year-old, from the box-line chart to see the average age of nearly 30 years.

Because age is a continuous value, we consider the relationship between age and survival by staging a statistical display of age.

The odds of getting older from the data are bigger. There was a significant difference in survival rates between different age groups, indicating that age was related to survival.

3.2.4 the relationship between brothers and sisters and whether they are alive or not

From the data, siblings have the highest survival rate in 1-2.

3.2.5 whether there is a relationship between parents ' children and survival

The data show that the number of parents and children in 1-3 survival rate is the highest, the more the number is decreased survival rate.

The relationship between 3.2.6 port and survival

Data show that the survival rate of the port is significantly higher. It may be that there are some ports in the middle of the boat and some of the passengers disembark.

This article references: Mr. Big Tree's Blog

Using Python for Titanic survival predictions-data exploration and analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.