Using Python to understand data---visualization analysis of kernel of house price forecast _

Using Python to understand data---visualization analysis of kernel of house price forecast __python

Last Update:2018-07-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Kernel original link: Https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

The race is a return to the housing forecast.

Prologue: Life is the most difficult to understand the ego.

Kernel about four areas

1. Understanding the problem: in relation to the problem, study their significance and importance to each variable
2. Univariate Study: This competition is for target variables (projected house prices)
3. Multivariate analysis: Try to analyze the relationship between independent variables and related variables
4. Cleaning data: Handling missing values, outliers, and category attributes

Note: One of the Seaborn libraries in the imported package is particularly useful, especially for visual analysis variables

One: Understanding Problems and Univariate studies

The problem is based on a series of properties to predict house prices, in this issue we first need to look at the distribution of housing prices, volatility range. Like this problem that we can analyze with subjective meaning, we can try to analyze which variables may have a strong correlation with the target (house price forecast) with their own subjective consciousness. Of course, subjective analysis can not be rational persuasive, and in many tasks, subjective meaning is inaccurate or no subjective meaning can be referred to.

The problem of subjective meaning analysis can bring us into question, but a better way is to analyze the correlation of variables from the data.

Let's analyze the goal first: house prices. First describe, look at the general distribution, you can also use (SNS is seaborn abbreviation) sns.distplot View the price distribution, but also can further calculate skewness, kurtosis information, these can be pandas by the DATAFRAME data structure is very good support.

After studying the house price, we will focus on the property research related to the house price. For example, in this example, we first look at the data by which attribute columns, according to the data of the attribute column, we feel that overallqual, Yearbuilt,totalbsmtsf,grlivarea and predicted housing prices have a strong correlation. For numeric properties, we can use a scatter plot to describe the properties of the property and the price, it is easy to find that the property characteristics of their guess and the price has no linear correlation. For category attributes, we can describe the relationship between the category attribute and the house price box-line graph. (Seaborn is really a drawing artifact)

Second: multivariable Research
Well, we've made an analysis of the goal (house price) univariate and have a subjective sense of what variables are likely to be related to house prices. It is easy to verify your conjecture by scatter plot and box line diagram.

But the data analysis needs to use the data to speak, but is not the artificial experience, certainly the experience plays the auxiliary analysis the function.

Below we through the data computation analysis, which attribute and the house price has the strong correlation, actually how is related.

To study the correlations between variables (including the relationship between property and house price), we can study them by covariance matrix.

by Dataframe Corr () It is easy to calculate the correlation coefficient and then use the thermal graph to easily visualize the correlation (Nice)

OK, now let's sift through the 10 properties that are most relevant to the house price and draw the heat map.

Nice Depending on the thermal diagram above, it is easy to analyze which attributes are linearly correlated with house prices, and also to analyze which attributes are highly correlated, which can provide a way of thinking about feature filtering.

Finally, there is a big seaborn, that is, directly visualize the distribution between all the attributes, it is perfect

Third: Data cleaning

Missing data processing

For missing data, we need to consider two issues. First: The distribution of missing data, second: The missing data is random or modal, and it is the correlation between the attribute column and the target

With this in question, let's first look at the distribution of missing data.

See found that some of the column missing rate has reached more than 90%, for missing serious attribute column, you can take the deletion of this property column.

We consider missing data data attribute column and target correlation, according to the covariance coefficient matrix heat graph, we know Garagexx these attribute correlation is very high, here their missing rate is also close, so you can consider leaving the most relevant attributes, delete the remaining few.

Data processing from outlier point

The processing of outlier points is the discovery and analysis of outliers.

You can use scatter graphs and sort analysis when you find them.

A few suspected outliers can be found from the above figure: two in the lower right and two in the upper-right corner. Two obvious problems in the lower right corner, can do the removal processing, the upper right corner of the two although far away from the large forces, but they are generally consistent with the property and price of the linear distribution trend, it can be kept.

The last thing to say is normality, homoscedasticity, linearity, absence of correlated errors

These concepts I still do not understand in the original text, but the original text of the author proposed to use log to correct the data distribution, so that the normal distribution, seemingly such processing makes the data easier to analyze and process, the data more standardized. (don't quite understand this piece)

Here's an example that explains that log makes the distribution normal, the original Grlivarea and Saleprice distributions are the following figure, and found that there are obvious cone parts

After the regularization, the cone distribution is gone and the data distribution is more natural and linear.

The final thing to say is that the type of data to do one-hot coding.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Using Python to understand data---visualization analysis of kernel of house price forecast __python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Using Python to understand data---visualization analysis of kernel of house price forecast __python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support