An overview of exploratory data analysis EDA

Last Update:2018-10-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Table of Contents 1. Steps and preparations for data exploration 2. Missing value handling

Why do I need to deal with missing values
Why is data has missing values?
Techniques for missing value processing
3. Outlier detection and processing
What's an outlier?
What is the types of outliers?
What is the causes of outliers?
What's the impact of outliers on dataset?
How to detect outlier?
How to remove outlier?
4. The Art of feature engineering
What is feature engineering?
The process of characteristic engineering
What is a variable transformation?
How to use Slew volume slew exchange
A general method of variable transformation
What's feature variable creation and its benefits?

1. Steps and preparations for data exploration

Garbage in, Garbage out

The steps involved in data understanding and processing:

Variable representation (Variable identification)
Univariate analyses (univariate analysis)
Multivariate analysis (bi-variate)
Missing value processing (Missing values treatment)
Outlier handling (outlier treatment)
Variable transformation (Variable transformation)
Variable creation
Finally, to get a better model, you need to iterate step 4-7 several times

Variable Representation Variable identification

Identity features (Predictor, Input) and target values (target, output)
Identification data type (type) and categories (category)

Variables can be defined as different categories:

Single-Variable analysis

At this stage, analyze the variables first. The specific analysis method is determined by the variable being discrete (categorical) or continuous (continuous).

Continuous variables : For continuous variables, you need to know its concentration trend (central tendency) and dispersion (spread). You can use the following statistical indicators and visualization methods:

Discrete variables : For discrete variables, it is often necessary to know the frequency of their values-- Count & count%. You can use bar chart to visualize it.

Bi-variate Analysis of multivariate analyses

Look for relationships between variables at predefined levels of significance. Variable pairs can be discrete vs. discrete, continuous vs. continuous, discrete vs. continuous, specific analysis method determined by variable type

Continuous & Continuous
Analyzing the relationship between successive variables, we can directly observe the scatter plot between the variables. This is a neat way to look for continuous variable relationships, because the patterns shown in scatter plots can represent relationships between variables, which can be linear or nonlinear:

Scatter plots can see the relationship between variables, but not the specific correlation intensity. For this reason, the correlation metric (Correlation) is raised:

-1: Completely linear negative correlation
+ 1: Fully linear positive correlation
0: Irrelevant

The correlation can be calculated as follows:
Correlation = covariance (x, y)/SQRT (Var (×) * VAR (Y))

Categorical & categorical

Two-way table: row and column tables represent two variables, numeric values indicate frequency or frequencies
Stacked Histogram (Stacked column chart): This approach is more of a visual form of a bidirectional table
Chi-square test (chi-square test)
The chi-square test is usually used to obtain statistical significance of the relationship between variables, and it examines whether the characteristics shown in the sample are sufficient to reflect the overall characteristics. Chi-square test the difference between the predicted frequency and the actual frequency based on one or more categories of variables in a bidirectional table, which returns the probability of a chi-square distribution under a given degree of freedom
Probability 1: Indicates that two variables are relevant (dependent)
Probability 0: Indicates that two variables are irrelevant (independent)
(To be Continued ...)

Categorical & Continuous
You can visually analyze discrete vs continuous data through box plot.
However, if the value of the discrete variable is too small, it will not have a statistical significance _. To calculate statistical significance, you can use Z-test, t-test, or ANOVA.

z-test/t-test : Evaluate whether the difference between the mean values of two sets of variables is statistically significant

It is more meaningful if the probability of Z is less than two mean. The T-test is similar to the z-test, but it uses a sample of less than 30 between two categories
ANOVA : Evaluate whether the difference between the mean values of two sets of variables is statistically significant

2. Missing value processing Why to handle missing values

Missing values in the training data can reduce the generalization ability of the model, or get a biased (biased) model, because we are not able to accurately analyze the relationship between variables. This can cause a prediction or classification error.

The left and right sides show two cases of unhandled missing values and handling missing values, and they get a completely different conclusion.

Why missing values appear in the dataset

Missing values are generated in two phases of data analysis:

Data Extraction : Errors in the data extraction process tend to produce missing values. But this type of error is easily detected and replaced by the correct process.
Errors generated during the phase are difficult to correct and can be summarized in the following four categories:
- Completely random missing.
  When the probability of a sample missing is the same as all observations. For example, in the course of research, respondents randomly answered questions.
- Random missing
  Sometimes a variable missing value is generated randomly, and the missing frequency is not the same between different values. In some cases, for example, the loss of age and gender in female samples would be more than that of men.
- Missing values depend on some non-observable variables
  In this case, the absence of a value is not random, but depends on some (input) variables that we have not observed. For example, "physical discomfort" may be the cause of some particular diagnosis, but unless we can add "physical discomfort" to the input variable, it is possible to create a "random" absence.
- Missing values depend on the missing value itself
  For example, people whose incomes are high or very low usually do not provide their own income information

How to handle missing values

Delete (deletion)
- Delete a sample with missing values
- Delete the missing value itself and train with the remaining data of the sample so that different variables may have different sample sizes

When data loss is randomly generated, you may consider using the Delete method.

Mean/majority/median fill
The method of populating the missing values with the mean/majority/median number. The goal is to evaluate missing values using relationships that can be identified from valid values in the dataset.

Overall fill : Fill missing values with uniform metrics (mean/median, etc.)
similar fills : For samples that are similar to other dimensions, populate with the same values. such as: for different genders with different statistics for missing values to fill

Using predictive models
A method for estimating missing values using a predictive model.
A dataset without missing values as a training set, a sample with missing values as a test set, and a variable with missing values are target variables. You can use regression, anova,logistic regression and other methods to make predictions. But there are two drawbacks to this approach:
1. Model estimates tend to be more neat than actual values (well-behaved)
2. If there is no relationship between the variables, then the predicted value of the model may be inaccurate
KNN padding
The missing values are populated with the attribute values of the sample that most closely resembles the sample containing the missing values. The similarity is measured by distance.
- Advantages
  - KNN is able to predict qualitative and quantitative attributes
  - No need to build predictive models
  - Properties with multiple missing values are also easy to handle
  - The correlation between attributes in a dataset is also considered.
- Disadvantages
  - The KNN algorithm takes a lot of time to process a large data set, traversing the entire data set to find the closest sample to the target.
  - The choice of K value has a great effect on the final result.

3. Outlier detection and processing method outliers

Samples that are far away from the overall pattern

Exception value type

Univariate outliers: By observing the distribution of a single variable, you can see
Multivariable outliers: Outliers in n-dimensional space

An example of a multivariate outlier:

Cause of the exception

Regardless of the circumstances under which outliers are encountered, the ideal treatment is to find the cause of the outliers, and the handling of outliers depends on the cause of the outliers. There are usually two main reasons why outliers occur:

Man-made mistakes/non-natural
Natural

The specific reason for the exception value:

Measurement error: The error in the measurement tool produces an outlier. For example: There are 10 scales, one of which is bad ...
Experimental error
Intentional exception: This is often seen in those research tools. For example: Teenagers usually do not fill in their actual amount of alcohol, and those who truthfully filled out may be abnormal values ...
Data processing error
Sampling error
Natural outlier: These outliers are real and not produced by other errors

Effects of outliers

Outliers can significantly alter the results of data analysis and statistical modeling

Increased error errors and reduced statistical test results (power of statistical tests)
If the outliers are not randomly distributed, the normality is reduced
Outliers can have an impact on the nature of the dataset itself
Outliers affect the basic assumptions of regression analysis, ANOVA, and other statistical models

Examples of outliers affecting statistical results:

Outlier detection

The common method for outlier detection is data visualization:box-plot, histogram, scatter plot
Other common rules:

Any value over -1.5 x IQR ~ 1.5 x IQR
Override method: Any value that is more than 5% or greater than 95% percentile is considered an outlier
Data points with a distance average of 3 times times the label difference
Outlier detection is a special test case of an influential data point, which relies on a specific understanding of the business problem
Multivariate outliers are usually identified using indicators such as influence, weight, or distance, and the commonly used indices are Mahalanobis ' distance and Cook's D.

Exception value Handling

Most outliers are handled in a similar way to missing values, removing, converting, and binary outliers as a separate grouping, padding value, or other statistical processing.

Delete Exception samples
Consider deleting outliers if the exception is due to data entry errors, processing errors, or very few exception samples

Variable transformation and two value
Logarithmic can reduce variance of data
Binary enables decision trees to handle outliers better
Different weights can be assigned to different samples

Fill

Handled separately
If the outliers are large enough, they should be treated as a separate grouping when modeling statistics. One approach is to separate the different groupings, model them separately, and then merge the results

4. Feature Engineering's artistic feature engineering

Feature engineering is the technique of extracting more information from existing data (science and art). The data does not increase, but it makes the existing data more useful.
For example, from the date information in the dataset, the corresponding week and month information can be obtained, which may make the model more efficient.

The process of characteristic engineering

Before feature engineering, you need to complete 5 exploratory data analysis steps:

Variable identification
Single-Variable analysis
Multivariate analysis
Missing value handling
Exception value Handling

Feature engineering can be divided into two steps:

Slew Volume Slew Exchange
Feature Extraction

Both of these steps are important in exploring data analysis and have a significant impact on the predicted results

Variable transformation

Variable transformations can be understood as changing the distribution of a variable itself, or its relationship to other variables
For example: WithX??√Or< Span style= "font-family:stixgeneral,"arial Unicode ms", serif; font-size:83%; Font-style:normal; Font-weight:normal "> l ogx

Instead of X

When do I need to slew volume slew change

When we need to change the range of values of avariable or normalize it (standardize) in order to better understand the variable
When we can transform a complex nonlinear relationship into a linear relationship (transform complex non-linear relationships into linear relationships). Linear relationships are better understood than non-linear or curved relationships
The distribution of symmetry is better than the skewed distribution (symmetric distribution is preferred over skewed distribution) because it is better interpreted and more prone to inference. Some modeling techniques require that variables be normally distributed, so when dealing with a skewed distribution, consider using variable transformations to reduce their inclination.
For right skewed, you can use the open square/cubic, logarithmic method
For left skewed, you can use the square/cubic or exponential method
The slew volume slew is also to be considered from a specific implementation perspective (Implementation point of view). For example, according to the actual situation, the age is divided into 3 more meaningful groupings such as discretization (Bining of Variables) method.

The common method of Slew volume slew exchange

Take logarithm
Log is commonly used to handle right-leaning problems, but not for negative numbers or 0
Open Square/cubic
Discretization of
Discretization is used to classify variables that can be used for raw data, percentages, or frequencies, and categorical decisions often depend on specific problems. You can also produce discrete results based on multiple variables.

Feature extraction and its significance

Feature extraction is a process of creating new variables/features based on existing features. For example, you can create other corresponding time representations based on a time variable:

Common feature extraction methods include:

Create a derived variable
Create an indicator variable (dummy variables)
More methods: 5 simple manipulations to extract maximum information out of your data

Resources

A Comprehensive Guide to Data exploration
5 Simple manipulations-extract maximum information out of your data
Dummy Variable (statistics)

An overview of exploratory data analysis EDA

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

An overview of exploratory data analysis EDA

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

An overview of exploratory data analysis EDA

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support