Kernel original link: Https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
The race is a return to the housing forecast.
Prologue: Life is the most difficult to understand the ego.
Kernel about four areas
1. Understanding the problem: in relation to the problem, study their significance and importance to each variable
2. Univariate Study: This competition is for target variables (projected house prices)
3. Multivariate analysis: Try to analyze the relationship between independent variables and related variables
4. Cleaning data: Handling missing values, outliers, and category attributes
Note: One of the Seaborn libraries in the imported package is particularly useful, especially for visual analysis variables
One: Understanding Problems and Univariate studies
The problem is based on a series of properties to predict house prices, in this issue we first need to look at the distribution of housing prices, volatility range. Like this problem that we can analyze with subjective meaning, we can try to analyze which variables may have a strong correlation with the target (house price forecast) with their own subjective consciousness. Of course, subjective analysis can not be rational persuasive, and in many tasks, subjective meaning is inaccurate or no subjective meaning can be referred to.
The problem of subjective meaning analysis can bring us into question, but a better way is to analyze the correlation of variables from the data.
Let's analyze the goal first: house prices. First describe, look at the general distribution, you can also use (SNS is seaborn abbreviation) sns.distplot View the price distribution, but also can further calculate skewness, kurtosis information, these can be pandas by the DATAFRAME data structure is very good support.
After studying the house price, we will focus on the property research related to the house price. For example, in this example, we first look at the data by which attribute columns, according to the data of the attribute column, we feel that overallqual, Yearbuilt,totalbsmtsf,grlivarea and predicted housing prices have a strong correlation. For numeric properties, we can use a scatter plot to describe the properties of the property and the price, it is easy to find that the property characteristics of their guess and the price has no linear correlation. For category attributes, we can describe the relationship between the category attribute and the house price box-line graph. (Seaborn is really a drawing artifact)
Second: multivariable Research
Well, we've made an analysis of the goal (house price) univariate and have a subjective sense of what variables are likely to be related to house prices. It is easy to verify your conjecture by scatter plot and box line diagram.
But the data analysis needs to use the data to speak, but is not the artificial experience, certainly the experience plays the auxiliary analysis the function.
Below we through the data computation analysis, which attribute and the house price has the strong correlation, actually how is related.
To study the correlations between variables (including the relationship between property and house price), we can study them by covariance matrix.
by Dataframe Corr () It is easy to calculate the correlation coefficient and then use the thermal graph to easily visualize the correlation (Nice)
OK, now let's sift through the 10 properties that are most relevant to the house price and draw the heat map.
Nice Depending on the thermal diagram above, it is easy to analyze which attributes are linearly correlated with house prices, and also to analyze which attributes are highly correlated, which can provide a way of thinking about feature filtering.
Finally, there is a big seaborn, that is, directly visualize the distribution between all the attributes, it is perfect
Third: Data cleaning
Missing data processing
For missing data, we need to consider two issues. First: The distribution of missing data, second: The missing data is random or modal, and it is the correlation between the attribute column and the target
With this in question, let's first look at the distribution of missing data.
See found that some of the column missing rate has reached more than 90%, for missing serious attribute column, you can take the deletion of this property column.
We consider missing data data attribute column and target correlation, according to the covariance coefficient matrix heat graph, we know Garagexx these attribute correlation is very high, here their missing rate is also close, so you can consider leaving the most relevant attributes, delete the remaining few.
Data processing from outlier point
The processing of outlier points is the discovery and analysis of outliers.
You can use scatter graphs and sort analysis when you find them.
A few suspected outliers can be found from the above figure: two in the lower right and two in the upper-right corner. Two obvious problems in the lower right corner, can do the removal processing, the upper right corner of the two although far away from the large forces, but they are generally consistent with the property and price of the linear distribution trend, it can be kept.
The last thing to say is normality, homoscedasticity, linearity, absence of correlated errors
These concepts I still do not understand in the original text, but the original text of the author proposed to use log to correct the data distribution, so that the normal distribution, seemingly such processing makes the data easier to analyze and process, the data more standardized. (don't quite understand this piece)
Here's an example that explains that log makes the distribution normal, the original Grlivarea and Saleprice distributions are the following figure, and found that there are obvious cone parts
After the regularization, the cone distribution is gone and the data distribution is more natural and linear.
The final thing to say is that the type of data to do one-hot coding.