3.1 Data preprocessing
Three elements of data quality: accuracy, completeness, and consistency.
Main tasks of 3.1.2 data preprocessing
Data cleansing: Fill in missing values, smooth noise data, identify or remove outliers, and resolve inconsistencies to "clean up" data.
Data integration:
Data attribution:
3.2 Data Cleansing
3.2.1 Missing value
1. Ignore tuples
2. Manually fill in missing values
3. Populating missing values with a global constant
4. Fill the missing value with the center measure of the attribute: median
5. Use the attribute mean or median of all samples that belong to the same class as a given tuple
6. Fill in missing values with the most likely values: Regression/Bayesian/Decision Tree
The 6th type is the most popular strategy
3.2.2 Noise Data
Noise: The random error or variance of the variable being measured.
Data Smoothing Technology:
Sub-box: Smooth ordered data values by examining the nearest neighbor of the data. These ordered values are distributed in some barrels or bins. Because the sub-box examines the value of the nearest neighbor, it is partially smooth.
Example: 4,8,15,21,21,24,25,28,34
Sub-box: 3 values per carton
Box 1:4,8,15
Box 2:21,21,24
Box 3:25,28,34
Smooth with mean:
Box 1:9,9,9
Box 2:22,22,22
Box 3:29,29,29
Smooth with box boundary: The larger the width, the more smooth the effect
Box 1:4,4,15
Box 2:21,21,22
Box 3:25,25,34
Regression: Smoothing data with a function to fit the data. Linear regression involves finding the best line to fit two properties so that one property can be used to predict the other.
Outlier analysis: Detects outliers by clustering.
3.2.3 Data cleansing as a process
3.3 Data Integration
3.3.1 Entity identification problem
3.3.2 Redundancy and correlation analysis
Some redundancy can be detected by correlation analysis. Given two properties, this analysis can measure the extent to which a property can contain another based on available data.
For nominal data, with Chi-squared testing, for numeric attributes, with correlation coefficients and covariance, they all evaluate how one property's value changes with another.
1. Card-side related inspection of nominal data
Chi-Square statistical tests assume that A and B are independent.
Example 3.1 correlation analysis of nominal properties using Chi-square
Suppose to investigate 1500 people, record gender, whether fiction, then there are two attributes: gender, whether fiction.
|
Man |
Woman |
Total |
Novel |
250 (90) |
200 (360) |
450 |
Non-fiction |
50 (210) |
1000 (840) |
1050 |
Total |
300 |
120 |
1500 |
The expected frequency of units (male, fiction) is:
E11=count (male) *count (novel)/n=300*450/1500=90
According to the Chi-square value formula:
Degrees of Freedom are (2-1) (2-1) =1
2. Correlation coefficients of numerical data
For numerical data, the correlation of the two properties can be estimated by calculating the correlation coefficients of a and B ra,b
The value is between 1 and 1, and if Ra,b is greater than 0, then positive correlation means that the a value increases as the B value increases. The greater the value, the stronger the correlation. Therefore, a higher ra,b indicates that a or B can be removed as redundancy.
If the value is 0, the description is independent and has no correlation.
If that is less than 0, the description is negatively correlated, and one value increases by another.
Note that correlations do not imply causality, and if A and B are relevant, it does not mean that a causes B or B to cause a.
3. Covariance of numeric data
Covariance and variance are two similar measures that evaluate how the two properties change together. The mean values of A and B are also known as expectations.
The covariance of A and B is defined as:
For the two properties that tend to change together, A and B, if a is greater than E (a), B is likely to be greater than E (b). Therefore, the covariance of A and B is positive. On the other hand, if one property is less than its expected value, and the other attribute trend is greater than its expected value, the covariance of A and B is negative.
If a and B are independent, E (AB) =e (a) *e (b), the covariance is 0. However, if the covariance is 0, it is not necessarily independent. 、
Example: Covariance analysis of numerical attributes
Point in time |
Allelectronics |
Hightech |
T1 |
6 |
20 |
T2 |
5 |
10 |
T3 |
4 |
14 |
T4 |
3 |
5 |
T5 |
2 |
5 |
E (Allelectronics) = (6+5+4+3+2)/5=4
E (Hightech) = (20+10+14+5+5)/5=10.8
cov= (6*20+5*10+4*14+3*5+2*5)/5-4*10.8=50.2-43.2=7
The covariance is positive, indicating that the two companies ' shares are rising simultaneously
3.3.3 Tuples Repeat
Detection and processing of 3.3.4 data value conflicts
Data mining concepts and techniques reading notes (iii) data preprocessing