Table of Contents 1. Steps and preparations for data exploration 2. Missing value handling
1. Steps and preparations for data exploration
Garbage in, Garbage out
The steps involved in data understanding and processing:
 Variable representation (Variable identification)
 Univariate analyses (univariate analysis)
 Multivariate analysis (bivariate)
 Missing value processing (Missing values treatment)
 Outlier handling (outlier treatment)
 Variable transformation (Variable transformation)
 Variable creation
Finally, to get a better model, you need to iterate step 47 several times
Variable Representation Variable identification
Identity features (Predictor, Input) and target values (target, output)
Identification data type (type) and categories (category)
Variables can be defined as different categories:
SingleVariable analysis
At this stage, analyze the variables first. The specific analysis method is determined by the variable being discrete (categorical) or continuous (continuous).
Continuous variables : For continuous variables, you need to know its concentration trend (central tendency) and dispersion (spread). You can use the following statistical indicators and visualization methods:
Discrete variables : For discrete variables, it is often necessary to know the frequency of their values Count & count%. You can use bar chart to visualize it.
Bivariate Analysis of multivariate analyses
Look for relationships between variables at predefined levels of significance. Variable pairs can be discrete vs. discrete, continuous vs. continuous, discrete vs. continuous, specific analysis method determined by variable type
Continuous & Continuous
Analyzing the relationship between successive variables, we can directly observe the scatter plot between the variables. This is a neat way to look for continuous variable relationships, because the patterns shown in scatter plots can represent relationships between variables, which can be linear or nonlinear:
Scatter plots can see the relationship between variables, but not the specific correlation intensity. For this reason, the correlation metric (Correlation) is raised:
 1: Completely linear negative correlation
 + 1: Fully linear positive correlation
 0: Irrelevant
The correlation can be calculated as follows:
Correlation = covariance (x, y)/SQRT (Var (×) * VAR (Y))
Categorical & categorical
 Twoway table: row and column tables represent two variables, numeric values indicate frequency or frequencies
 Stacked Histogram (Stacked column chart): This approach is more of a visual form of a bidirectional table
 Chisquare test (chisquare test)
The chisquare test is usually used to obtain statistical significance of the relationship between variables, and it examines whether the characteristics shown in the sample are sufficient to reflect the overall characteristics. Chisquare test the difference between the predicted frequency and the actual frequency based on one or more categories of variables in a bidirectional table, which returns the probability of a chisquare distribution under a given degree of freedom
Probability 1: Indicates that two variables are relevant (dependent)
Probability 0: Indicates that two variables are irrelevant (independent)
(To be Continued ...)
Categorical & Continuous
You can visually analyze discrete vs continuous data through box plot.
However, if the value of the discrete variable is too small, it will not have a statistical significance _. To calculate statistical significance, you can use Ztest, ttest, or ANOVA.

ztest/ttest : Evaluate whether the difference between the mean values of two sets of variables is statistically significant
It is more meaningful if the probability of Z is less than two mean. The Ttest is similar to the ztest, but it uses a sample of less than 30 between two categories

ANOVA : Evaluate whether the difference between the mean values of two sets of variables is statistically significant
2. Missing value processing Why to handle missing values
Missing values in the training data can reduce the generalization ability of the model, or get a biased (biased) model, because we are not able to accurately analyze the relationship between variables. This can cause a prediction or classification error.
The left and right sides show two cases of unhandled missing values and handling missing values, and they get a completely different conclusion.
Why missing values appear in the dataset
Missing values are generated in two phases of data analysis:
 Data Extraction : Errors in the data extraction process tend to produce missing values. But this type of error is easily detected and replaced by the correct process.
 Errors generated during the phase are difficult to correct and can be summarized in the following four categories:
 Completely random missing.
When the probability of a sample missing is the same as all observations. For example, in the course of research, respondents randomly answered questions.
 Random missing
Sometimes a variable missing value is generated randomly, and the missing frequency is not the same between different values. In some cases, for example, the loss of age and gender in female samples would be more than that of men.
 Missing values depend on some nonobservable variables
In this case, the absence of a value is not random, but depends on some (input) variables that we have not observed. For example, "physical discomfort" may be the cause of some particular diagnosis, but unless we can add "physical discomfort" to the input variable, it is possible to create a "random" absence.
 Missing values depend on the missing value itself
For example, people whose incomes are high or very low usually do not provide their own income information
How to handle missing values
 Delete (deletion)
 Delete a sample with missing values
 Delete the missing value itself and train with the remaining data of the sample so that different variables may have different sample sizes
When data loss is randomly generated, you may consider using the Delete method.
 Mean/majority/median fill
The method of populating the missing values with the mean/majority/median number. The goal is to evaluate missing values using relationships that can be identified from valid values in the dataset.
 Overall fill : Fill missing values with uniform metrics (mean/median, etc.)
 similar fills : For samples that are similar to other dimensions, populate with the same values. such as: for different genders with different statistics for missing values to fill
 Using predictive models
A method for estimating missing values using a predictive model.
A dataset without missing values as a training set, a sample with missing values as a test set, and a variable with missing values are target variables. You can use regression, anova,logistic regression and other methods to make predictions. But there are two drawbacks to this approach:
 Model estimates tend to be more neat than actual values (wellbehaved)
 If there is no relationship between the variables, then the predicted value of the model may be inaccurate
 KNN padding
The missing values are populated with the attribute values of the sample that most closely resembles the sample containing the missing values. The similarity is measured by distance.
 Advantages
 KNN is able to predict qualitative and quantitative attributes
 No need to build predictive models
 Properties with multiple missing values are also easy to handle
 The correlation between attributes in a dataset is also considered.
 Disadvantages
 The KNN algorithm takes a lot of time to process a large data set, traversing the entire data set to find the closest sample to the target.
 The choice of K value has a great effect on the final result.
3. Outlier detection and processing method outliers
Samples that are far away from the overall pattern
Exception value type
 Univariate outliers: By observing the distribution of a single variable, you can see
 Multivariable outliers: Outliers in ndimensional space
An example of a multivariate outlier:
Cause of the exception
Regardless of the circumstances under which outliers are encountered, the ideal treatment is to find the cause of the outliers, and the handling of outliers depends on the cause of the outliers. There are usually two main reasons why outliers occur:
 Manmade mistakes/nonnatural
 Natural
The specific reason for the exception value:
 Measurement error: The error in the measurement tool produces an outlier. For example: There are 10 scales, one of which is bad ...
 Experimental error
 Intentional exception: This is often seen in those research tools. For example: Teenagers usually do not fill in their actual amount of alcohol, and those who truthfully filled out may be abnormal values ...
 Data processing error
 Sampling error
 Natural outlier: These outliers are real and not produced by other errors
Effects of outliers
Outliers can significantly alter the results of data analysis and statistical modeling
 Increased error errors and reduced statistical test results (power of statistical tests)
 If the outliers are not randomly distributed, the normality is reduced
 Outliers can have an impact on the nature of the dataset itself
 Outliers affect the basic assumptions of regression analysis, ANOVA, and other statistical models
Examples of outliers affecting statistical results:
Outlier detection
The common method for outlier detection is data visualization:boxplot, histogram, scatter plot
Other common rules:
 Any value over 1.5 x IQR ~ 1.5 x IQR
 Override method: Any value that is more than 5% or greater than 95% percentile is considered an outlier
 Data points with a distance average of 3 times times the label difference
 Outlier detection is a special test case of an influential data point, which relies on a specific understanding of the business problem
 Multivariate outliers are usually identified using indicators such as influence, weight, or distance, and the commonly used indices are Mahalanobis ' distance and Cook's D.
Exception value Handling
Most outliers are handled in a similar way to missing values, removing, converting, and binary outliers as a separate grouping, padding value, or other statistical processing.
Delete Exception samples
Consider deleting outliers if the exception is due to data entry errors, processing errors, or very few exception samples
Variable transformation and two value
Logarithmic can reduce variance of data
Binary enables decision trees to handle outliers better
Different weights can be assigned to different samples
Fill
Handled separately
If the outliers are large enough, they should be treated as a separate grouping when modeling statistics. One approach is to separate the different groupings, model them separately, and then merge the results
4. Feature Engineering's artistic feature engineering
Feature engineering is the technique of extracting more information from existing data (science and art). The data does not increase, but it makes the existing data more useful.
For example, from the date information in the dataset, the corresponding week and month information can be obtained, which may make the model more efficient.
The process of characteristic engineering
Before feature engineering, you need to complete 5 exploratory data analysis steps:
 Variable identification
 SingleVariable analysis
 Multivariate analysis
 Missing value handling
 Exception value Handling
Feature engineering can be divided into two steps:
 Slew Volume Slew Exchange
 Feature Extraction
Both of these steps are important in exploring data analysis and have a significant impact on the predicted results
Variable transformation
Variable transformations can be understood as changing the distribution of a variable itself, or its relationship to other variables
For example: WithX??√Or< Span style= "fontfamily:stixgeneral,"arial Unicode ms", serif; fontsize:83%; Fontstyle:normal; Fontweight:normal "> l ogx
Instead of X
When do I need to slew volume slew change
 When we need to change the range of values of avariable or normalize it (standardize) in order to better understand the variable
 When we can transform a complex nonlinear relationship into a linear relationship (transform complex nonlinear relationships into linear relationships). Linear relationships are better understood than nonlinear or curved relationships
 The distribution of symmetry is better than the skewed distribution (symmetric distribution is preferred over skewed distribution) because it is better interpreted and more prone to inference. Some modeling techniques require that variables be normally distributed, so when dealing with a skewed distribution, consider using variable transformations to reduce their inclination.
For right skewed, you can use the open square/cubic, logarithmic method
For left skewed, you can use the square/cubic or exponential method
 The slew volume slew is also to be considered from a specific implementation perspective (Implementation point of view). For example, according to the actual situation, the age is divided into 3 more meaningful groupings such as discretization (Bining of Variables) method.
The common method of Slew volume slew exchange

Take logarithm
Log is commonly used to handle rightleaning problems, but not for negative numbers or 0
 Open Square/cubic

Discretization of
Discretization is used to classify variables that can be used for raw data, percentages, or frequencies, and categorical decisions often depend on specific problems. You can also produce discrete results based on multiple variables.
Feature extraction and its significance
Feature extraction is a process of creating new variables/features based on existing features. For example, you can create other corresponding time representations based on a time variable:
Common feature extraction methods include:
 Create a derived variable
 Create an indicator variable (dummy variables)
 More methods: 5 simple manipulations to extract maximum information out of your data
Resources
 A Comprehensive Guide to Data exploration
 5 Simple manipulationsextract maximum information out of your data
 Dummy Variable (statistics)
An overview of exploratory data analysis EDA