Data mining concepts and techniques reading notes (iii) data preprocessing

Source: Internet
Author: User

3.1 Data preprocessing

Three elements of data quality: accuracy, completeness, and consistency.

Main tasks of 3.1.2 data preprocessing

Data cleansing: Fill in missing values, smooth noise data, identify or remove outliers, and resolve inconsistencies to "clean up" data.

Data integration:

Data attribution:

3.2 Data Cleansing

3.2.1 Missing value

1. Ignore tuples

2. Manually fill in missing values

3. Populating missing values with a global constant

4. Fill the missing value with the center measure of the attribute: median

5. Use the attribute mean or median of all samples that belong to the same class as a given tuple

6. Fill in missing values with the most likely values: Regression/Bayesian/Decision Tree

The 6th type is the most popular strategy

3.2.2 Noise Data

Noise: The random error or variance of the variable being measured.

Data Smoothing Technology:

Sub-box: Smooth ordered data values by examining the nearest neighbor of the data. These ordered values are distributed in some barrels or bins. Because the sub-box examines the value of the nearest neighbor, it is partially smooth.

Example: 4,8,15,21,21,24,25,28,34

Sub-box: 3 values per carton

Box 1:4,8,15

Box 2:21,21,24

Box 3:25,28,34

Smooth with mean:

Box 1:9,9,9

Box 2:22,22,22

Box 3:29,29,29

Smooth with box boundary: The larger the width, the more smooth the effect

Box 1:4,4,15

Box 2:21,21,22

Box 3:25,25,34

Regression: Smoothing data with a function to fit the data. Linear regression involves finding the best line to fit two properties so that one property can be used to predict the other.

Outlier analysis: Detects outliers by clustering.

3.2.3 Data cleansing as a process

3.3 Data Integration

3.3.1 Entity identification problem

3.3.2 Redundancy and correlation analysis

Some redundancy can be detected by correlation analysis. Given two properties, this analysis can measure the extent to which a property can contain another based on available data.

For nominal data, with Chi-squared testing, for numeric attributes, with correlation coefficients and covariance, they all evaluate how one property's value changes with another.

1. Card-side related inspection of nominal data

Chi-Square statistical tests assume that A and B are independent.

Example 3.1 correlation analysis of nominal properties using Chi-square

Suppose to investigate 1500 people, record gender, whether fiction, then there are two attributes: gender, whether fiction.

Man Woman Total
Novel 250 (90) 200 (360) 450
Non-fiction 50 (210) 1000 (840) 1050
Total 300 120 1500

The expected frequency of units (male, fiction) is:

E11=count (male) *count (novel)/n=300*450/1500=90

According to the Chi-square value formula:

    

Degrees of Freedom are (2-1) (2-1) =1

2. Correlation coefficients of numerical data

For numerical data, the correlation of the two properties can be estimated by calculating the correlation coefficients of a and B ra,b

  

The value is between 1 and 1, and if Ra,b is greater than 0, then positive correlation means that the a value increases as the B value increases. The greater the value, the stronger the correlation. Therefore, a higher ra,b indicates that a or B can be removed as redundancy.

If the value is 0, the description is independent and has no correlation.

If that is less than 0, the description is negatively correlated, and one value increases by another.

Note that correlations do not imply causality, and if A and B are relevant, it does not mean that a causes B or B to cause a.

3. Covariance of numeric data

Covariance and variance are two similar measures that evaluate how the two properties change together. The mean values of A and B are also known as expectations.

The covariance of A and B is defined as:

  

  

For the two properties that tend to change together, A and B, if a is greater than E (a), B is likely to be greater than E (b). Therefore, the covariance of A and B is positive. On the other hand, if one property is less than its expected value, and the other attribute trend is greater than its expected value, the covariance of A and B is negative.

If a and B are independent, E (AB) =e (a) *e (b), the covariance is 0. However, if the covariance is 0, it is not necessarily independent. 、

Example: Covariance analysis of numerical attributes

Point in time Allelectronics Hightech
T1 6 20
T2 5 10
T3 4 14
T4 3 5
T5 2 5

E (Allelectronics) = (6+5+4+3+2)/5=4

E (Hightech) = (20+10+14+5+5)/5=10.8

cov= (6*20+5*10+4*14+3*5+2*5)/5-4*10.8=50.2-43.2=7

The covariance is positive, indicating that the two companies ' shares are rising simultaneously

3.3.3 Tuples Repeat

Detection and processing of 3.3.4 data value conflicts

Data mining concepts and techniques reading notes (iii) data preprocessing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.