An overview of exploratory data analysis EDA

Source: Internet
Author: User


Table of Contents 1. Steps and preparations for data exploration 2. Missing value handling
    • Why do I need to deal with missing values
    • Why is data has missing values?
    • Techniques for missing value processing

      3. Outlier detection and processing
    • What's an outlier?
    • What is the types of outliers?
    • What is the causes of outliers?
    • What's the impact of outliers on dataset?
    • How to detect outlier?
    • How to remove outlier?

      4. The Art of feature engineering
    • What is feature engineering?
    • The process of characteristic engineering
    • What is a variable transformation?
    • How to use Slew volume slew exchange
    • A general method of variable transformation
    • What's feature variable creation and its benefits?

1. Steps and preparations for data exploration

Garbage in, Garbage out


The steps involved in data understanding and processing:


    1. Variable representation (Variable identification)
    2. Univariate analyses (univariate analysis)
    3. Multivariate analysis (bi-variate)
    4. Missing value processing (Missing values treatment)
    5. Outlier handling (outlier treatment)
    6. Variable transformation (Variable transformation)
    7. Variable creation
      Finally, to get a better model, you need to iterate step 4-7 several times
Variable Representation Variable identification


Identity features (Predictor, Input) and target values (target, output)
Identification data type (type) and categories (category)



Variables can be defined as different categories:


Single-Variable analysis


At this stage, analyze the variables first. The specific analysis method is determined by the variable being discrete (categorical) or continuous (continuous).



Continuous variables : For continuous variables, you need to know its concentration trend (central tendency) and dispersion (spread). You can use the following statistical indicators and visualization methods:



Discrete variables : For discrete variables, it is often necessary to know the frequency of their values-- Count & count%. You can use bar chart to visualize it.


Bi-variate Analysis of multivariate analyses


Look for relationships between variables at predefined levels of significance. Variable pairs can be discrete vs. discrete, continuous vs. continuous, discrete vs. continuous, specific analysis method determined by variable type



Continuous & Continuous
Analyzing the relationship between successive variables, we can directly observe the scatter plot between the variables. This is a neat way to look for continuous variable relationships, because the patterns shown in scatter plots can represent relationships between variables, which can be linear or nonlinear:



Scatter plots can see the relationship between variables, but not the specific correlation intensity. For this reason, the correlation metric (Correlation) is raised:


    • -1: Completely linear negative correlation
    • + 1: Fully linear positive correlation
    • 0: Irrelevant


The correlation can be calculated as follows:
Correlation = covariance (x, y)/SQRT (Var (×) * VAR (Y))



Categorical & categorical


    • Two-way table: row and column tables represent two variables, numeric values indicate frequency or frequencies
    • Stacked Histogram (Stacked column chart): This approach is more of a visual form of a bidirectional table
    • Chi-square test (chi-square test)
      The chi-square test is usually used to obtain statistical significance of the relationship between variables, and it examines whether the characteristics shown in the sample are sufficient to reflect the overall characteristics. Chi-square test the difference between the predicted frequency and the actual frequency based on one or more categories of variables in a bidirectional table, which returns the probability of a chi-square distribution under a given degree of freedom
      Probability 1: Indicates that two variables are relevant (dependent)
      Probability 0: Indicates that two variables are irrelevant (independent)
      (To be Continued ...)


Categorical & Continuous
You can visually analyze discrete vs continuous data through box plot.
However, if the value of the discrete variable is too small, it will not have a statistical significance _. To calculate statistical significance, you can use Z-test, t-test, or ANOVA.


    • z-test/t-test : Evaluate whether the difference between the mean values of two sets of variables is statistically significant

      It is more meaningful if the probability of Z is less than two mean. The T-test is similar to the z-test, but it uses a sample of less than 30 between two categories

    • ANOVA : Evaluate whether the difference between the mean values of two sets of variables is statistically significant

2. Missing value processing Why to handle missing values


Missing values in the training data can reduce the generalization ability of the model, or get a biased (biased) model, because we are not able to accurately analyze the relationship between variables. This can cause a prediction or classification error.

The left and right sides show two cases of unhandled missing values and handling missing values, and they get a completely different conclusion.


Why missing values appear in the dataset


Missing values are generated in two phases of data analysis:


    1. Data Extraction : Errors in the data extraction process tend to produce missing values. But this type of error is easily detected and replaced by the correct process.
    2.  Errors generated during the  phase are difficult to correct and can be summarized in the following four categories:
      • Completely random missing.
        When the probability of a sample missing is the same as all observations. For example, in the course of research, respondents randomly answered questions.
      • Random missing
        Sometimes a variable missing value is generated randomly, and the missing frequency is not the same between different values. In some cases, for example, the loss of age and gender in female samples would be more than that of men.
      • Missing values depend on some non-observable variables
        In this case, the absence of a value is not random, but depends on some (input) variables that we have not observed. For example, "physical discomfort" may be the cause of some particular diagnosis, but unless we can add "physical discomfort" to the input variable, it is possible to create a "random" absence.
      • Missing values depend on the missing value itself
        For example, people whose incomes are high or very low usually do not provide their own income information
How to handle missing values
    1. Delete (deletion)
      • Delete a sample with missing values
      • Delete the missing value itself and train with the remaining data of the sample so that different variables may have different sample sizes


When data loss is randomly generated, you may consider using the Delete method.


    1. Mean/majority/median fill
      The method of populating the missing values with the mean/majority/median number. The goal is to evaluate missing values using relationships that can be identified from valid values in the dataset.
    • Overall fill : Fill missing values with uniform metrics (mean/median, etc.)
    • similar fills : For samples that are similar to other dimensions, populate with the same values. such as: for different genders with different statistics for missing values to fill
    1. Using predictive models
      A method for estimating missing values using a predictive model.
      A dataset without missing values as a training set, a sample with missing values as a test set, and a variable with missing values are target variables. You can use regression, anova,logistic regression and other methods to make predictions. But there are two drawbacks to this approach:
      1. Model estimates tend to be more neat than actual values (well-behaved)
      2. If there is no relationship between the variables, then the predicted value of the model may be inaccurate
    2. KNN padding
      The missing values are populated with the attribute values of the sample that most closely resembles the sample containing the missing values. The similarity is measured by distance.
      • Advantages
        • KNN is able to predict qualitative and quantitative attributes
        • No need to build predictive models
        • Properties with multiple missing values are also easy to handle
        • The correlation between attributes in a dataset is also considered.
      • Disadvantages
        • The KNN algorithm takes a lot of time to process a large data set, traversing the entire data set to find the closest sample to the target.
        • The choice of K value has a great effect on the final result.
3. Outlier detection and processing method outliers


Samples that are far away from the overall pattern


Exception value type
    • Univariate outliers: By observing the distribution of a single variable, you can see
    • Multivariable outliers: Outliers in n-dimensional space


An example of a multivariate outlier:


Cause of the exception


Regardless of the circumstances under which outliers are encountered, the ideal treatment is to find the cause of the outliers, and the handling of outliers depends on the cause of the outliers. There are usually two main reasons why outliers occur:


    1. Man-made mistakes/non-natural
    2. Natural


The specific reason for the exception value:

    • Measurement error: The error in the measurement tool produces an outlier. For example: There are 10 scales, one of which is bad ...
    • Experimental error
    • Intentional exception: This is often seen in those research tools. For example: Teenagers usually do not fill in their actual amount of alcohol, and those who truthfully filled out may be abnormal values ...
    • Data processing error
    • Sampling error
    • Natural outlier: These outliers are real and not produced by other errors
Effects of outliers


Outliers can significantly alter the results of data analysis and statistical modeling


    • Increased error errors and reduced statistical test results (power of statistical tests)
    • If the outliers are not randomly distributed, the normality is reduced
    • Outliers can have an impact on the nature of the dataset itself
    • Outliers affect the basic assumptions of regression analysis, ANOVA, and other statistical models


Examples of outliers affecting statistical results:


Outlier detection


The common method for outlier detection is data visualization:box-plot, histogram, scatter plot
Other common rules:


    • Any value over -1.5 x IQR ~ 1.5 x IQR
    • Override method: Any value that is more than 5% or greater than 95% percentile is considered an outlier
    • Data points with a distance average of 3 times times the label difference
    • Outlier detection is a special test case of an influential data point, which relies on a specific understanding of the business problem
    • Multivariate outliers are usually identified using indicators such as influence, weight, or distance, and the commonly used indices are Mahalanobis ' distance and Cook's D.
Exception value Handling


Most outliers are handled in a similar way to missing values, removing, converting, and binary outliers as a separate grouping, padding value, or other statistical processing.



Delete Exception samples
Consider deleting outliers if the exception is due to data entry errors, processing errors, or very few exception samples



Variable transformation and two value
Logarithmic can reduce variance of data
Binary enables decision trees to handle outliers better
Different weights can be assigned to different samples



Fill



Handled separately
If the outliers are large enough, they should be treated as a separate grouping when modeling statistics. One approach is to separate the different groupings, model them separately, and then merge the results


4. Feature Engineering's artistic feature engineering


Feature engineering is the technique of extracting more information from existing data (science and art). The data does not increase, but it makes the existing data more useful.
For example, from the date information in the dataset, the corresponding week and month information can be obtained, which may make the model more efficient.


The process of characteristic engineering


Before feature engineering, you need to complete 5 exploratory data analysis steps:


    • Variable identification
    • Single-Variable analysis
    • Multivariate analysis
    • Missing value handling
    • Exception value Handling


Feature engineering can be divided into two steps:


    • Slew Volume Slew Exchange
    • Feature Extraction


Both of these steps are important in exploring data analysis and have a significant impact on the predicted results


Variable transformation

Variable transformations can be understood as changing the distribution of a variable itself, or its relationship to other variables
For example: WithX??√Or< Span style= "font-family:stixgeneral,"arial Unicode ms", serif; font-size:83%; Font-style:normal; Font-weight:normal "> l ogx


Instead of X


When do I need to slew volume slew change
    • When we need to change the range of values of avariable or normalize it (standardize) in order to better understand the variable
    • When we can transform a complex nonlinear relationship into a linear relationship (transform complex non-linear relationships into linear relationships). Linear relationships are better understood than non-linear or curved relationships
    • The distribution of symmetry is better than the skewed distribution (symmetric distribution is preferred over skewed distribution) because it is better interpreted and more prone to inference. Some modeling techniques require that variables be normally distributed, so when dealing with a skewed distribution, consider using variable transformations to reduce their inclination.
      For right skewed, you can use the open square/cubic, logarithmic method
      For left skewed, you can use the square/cubic or exponential method
    • The slew volume slew is also to be considered from a specific implementation perspective (Implementation point of view). For example, according to the actual situation, the age is divided into 3 more meaningful groupings such as discretization (Bining of Variables) method.
The common method of Slew volume slew exchange
    • Take logarithm
      Log is commonly used to handle right-leaning problems, but not for negative numbers or 0

    • Open Square/cubic
    • Discretization of
      Discretization is used to classify variables that can be used for raw data, percentages, or frequencies, and categorical decisions often depend on specific problems. You can also produce discrete results based on multiple variables.

Feature extraction and its significance


Feature extraction is a process of creating new variables/features based on existing features. For example, you can create other corresponding time representations based on a time variable:



Common feature extraction methods include:


    • Create a derived variable
    • Create an indicator variable (dummy variables)
    • More methods: 5 simple manipulations to extract maximum information out of your data
Resources
    1. A Comprehensive Guide to Data exploration
    2. 5 Simple manipulations-extract maximum information out of your data
    3. Dummy Variable (statistics)


An overview of exploratory data analysis EDA


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.