Data mining concepts and techniques reading notes (iii) data preprocessing

Last Update:2016-02-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

3.1 Data preprocessing

Three elements of data quality: accuracy, completeness, and consistency.

Main tasks of 3.1.2 data preprocessing

Data cleansing: Fill in missing values, smooth noise data, identify or remove outliers, and resolve inconsistencies to "clean up" data.

Data integration:

Data attribution:

3.2 Data Cleansing

3.2.1 Missing value

1. Ignore tuples

2. Manually fill in missing values

3. Populating missing values with a global constant

4. Fill the missing value with the center measure of the attribute: median

5. Use the attribute mean or median of all samples that belong to the same class as a given tuple

6. Fill in missing values with the most likely values: Regression/Bayesian/Decision Tree

The 6th type is the most popular strategy

3.2.2 Noise Data

Noise: The random error or variance of the variable being measured.

Data Smoothing Technology:

Sub-box: Smooth ordered data values by examining the nearest neighbor of the data. These ordered values are distributed in some barrels or bins. Because the sub-box examines the value of the nearest neighbor, it is partially smooth.

Example: 4,8,15,21,21,24,25,28,34

Sub-box: 3 values per carton

Box 1:4,8,15

Box 2:21,21,24

Box 3:25,28,34

Smooth with mean:

Box 1:9,9,9

Box 2:22,22,22

Box 3:29,29,29

Smooth with box boundary: The larger the width, the more smooth the effect

Box 1:4,4,15

Box 2:21,21,22

Box 3:25,25,34

Regression: Smoothing data with a function to fit the data. Linear regression involves finding the best line to fit two properties so that one property can be used to predict the other.

Outlier analysis: Detects outliers by clustering.

3.2.3 Data cleansing as a process

3.3 Data Integration

3.3.1 Entity identification problem

3.3.2 Redundancy and correlation analysis

Some redundancy can be detected by correlation analysis. Given two properties, this analysis can measure the extent to which a property can contain another based on available data.

For nominal data, with Chi-squared testing, for numeric attributes, with correlation coefficients and covariance, they all evaluate how one property's value changes with another.

1. Card-side related inspection of nominal data

Chi-Square statistical tests assume that A and B are independent.

Example 3.1 correlation analysis of nominal properties using Chi-square

Suppose to investigate 1500 people, record gender, whether fiction, then there are two attributes: gender, whether fiction.

	Man	Woman	Total
Novel	250 (90)	200 (360)	450
Non-fiction	50 (210)	1000 (840)	1050
Total	300	120	1500

The expected frequency of units (male, fiction) is:

E11=count (male) *count (novel)/n=300*450/1500=90

According to the Chi-square value formula:

Degrees of Freedom are (2-1) (2-1) =1

2. Correlation coefficients of numerical data

For numerical data, the correlation of the two properties can be estimated by calculating the correlation coefficients of a and B ra,b

The value is between 1 and 1, and if Ra,b is greater than 0, then positive correlation means that the a value increases as the B value increases. The greater the value, the stronger the correlation. Therefore, a higher ra,b indicates that a or B can be removed as redundancy.

If the value is 0, the description is independent and has no correlation.

If that is less than 0, the description is negatively correlated, and one value increases by another.

Note that correlations do not imply causality, and if A and B are relevant, it does not mean that a causes B or B to cause a.

3. Covariance of numeric data

Covariance and variance are two similar measures that evaluate how the two properties change together. The mean values of A and B are also known as expectations.

The covariance of A and B is defined as:

For the two properties that tend to change together, A and B, if a is greater than E (a), B is likely to be greater than E (b). Therefore, the covariance of A and B is positive. On the other hand, if one property is less than its expected value, and the other attribute trend is greater than its expected value, the covariance of A and B is negative.

If a and B are independent, E (AB) =e (a) *e (b), the covariance is 0. However, if the covariance is 0, it is not necessarily independent. 、

Example: Covariance analysis of numerical attributes

Point in time	Allelectronics	Hightech
T1	6	20
T2	5	10
T3	4	14
T4	3	5
T5	2	5

E (Allelectronics) = (6+5+4+3+2)/5=4

E (Hightech) = (20+10+14+5+5)/5=10.8

cov= (6*20+5*10+4*14+3*5+2*5)/5-4*10.8=50.2-43.2=7

The covariance is positive, indicating that the two companies ' shares are rising simultaneously

3.3.3 Tuples Repeat

Detection and processing of 3.3.4 data value conflicts

Data mining concepts and techniques reading notes (iii) data preprocessing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data mining concepts and techniques reading notes (iii) data preprocessing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data mining concepts and techniques reading notes (iii) data preprocessing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support