Data Mining notes (III)-data preprocessing

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Problems with raw data: inconsistency, duplication, noise, and high dimension.

2. data preprocessing includes data cleansing, data integration, data transformation, and data reduction methods.

3. Principles of data used in Data Mining

The proper attribute should be selected from the raw data as the data mining attribute. The selection process should refer to the principle of giving the attribute name and attribute value a clear meaning as much as possible; Uniform attribute value encoding for multiple data sources; unique attributes are removed, repeatability is removed, negligible fields are removed, and associated fields are reasonably selected.

4. method for handling the vacancy value: Ignore the record, remove the attribute, manually enter the vacancy value, use the default value, use the attribute average value, use the average value of similar samples, and predict the most likely value.

5. Noise Data Processing Methods: binning, clustering, computer and manual check, and Regression

6. binning: The binning method is a simple and common preprocessing method. The final value is determined by examining adjacent data. The so-called "binning" is actually a subinterval divided by attribute values. If a property value is within a subinterval, put the attribute value in the "box" represented by this subinterval. Put the data to be processed (attribute values of a column) into some boxes according to certain rules, investigate the data in each box, and use some method to process the data in each box respectively. When adopting the binning technology, we need to determine two main problems: how to binning and how to smoothly process the data in each box.

There are four binning Methods: equi-depth binning method, equi-width binning method, minimum entropy method, and user-defined interval method.

The same weight is also used to separate datasets into bins based on the number of records. Each Bin has the same number of records. The number of records in each bin is called the depth of the bin. This is the simplest binning method.

The uniform interval, also known as the same width binning method, distributes the dataset evenly across the range of the entire attribute value. That is, the interval range of each box is a constant called the box width.

User-defined intervals: You can customize intervals as needed. This method can be used to help you easily observe the data distribution within certain intervals.

For example, the value of the customer's income attribute income after sorting (RMB): 800 1000 1200 1500 1500 1800 2000 2300 2500 2800 3000 3500 4000 4500. The binning result is as follows.

Uniform weight: Set the weight (Box depth) to 4, after binning

Case 1: 800 1000 1200 1500

Case 2: 1500 1800 2000 2300

Case 3: 2500 2800 3000 3500

Case 4: 4000 4500 4800 5000

Unified range: Set the range (Box width) to 1000 yuan, after binning

Case 1: 800 1000 1200 1500 1500 1800

Case 2300 2500 2800 3000

Case 3: 3500 4000 4500

Case 4: 4800 5000

User-Defined: for example, divide the customer's income into less than 1000 RMB, 1000 ~ 2000, 2000 ~ 3000, 3000 ~ Groups of 4000 and 4000 yuan or more, after binning

Case 1: 800

Case 1200 1500 1500 1800 2000

Case 3: 2300 2500 2800 3000

Case 4: 3500 4000

Case 5: 4500 4800 5000

7. Data Smoothing Method: Average, boundary, and median.

(1) smooth by average

Calculate the average value for the data in the same box value, and use the average value to replace all the data in the box.

(2) Smoothing by Boundary Value

Replace each data in the box with a smaller boundary value.

(3) Smoothing by the mean value

Take the value of the box to replace all the data in the box.

8. Clustering: groups a set of physical or abstract objects into multiple classes composed of similar objects.

Identify and clear values (isolated points) that fall outside the cluster. These isolated points are considered as noise.

9. regression: it tries to find the Change Pattern between two related variables. By making the data suitable for a function to smooth the data, it creates a mathematical model to predict the next value, including linear regression and nonlinear regression.

10. Data Integration: combines heterogeneous data from multiple files or databases and stores them in a consistent data storage. Consider the following issues: 1. pattern matching 2. data redundancy 3. Data Value Conflict

11. data Transformation: 1. smooth 2. aggregation 3. data Generalization 4. normalization (1) Minimum-maximum normalization (2) Zero-mean normalization (3) decimal calibration normalization 5. attribute Construction

12. Data Integration: combines heterogeneous data from multiple files or databases and stores them in a consistent data storage. Consider the following issues: 1. pattern matching 2. data redundancy 3. Data Value Conflict

13. Data Reduction: to obtain a Data Mining dataset that is much smaller than the original data, but does not undermine data integrity, the dataset can obtain the same mining result as the original data.

Data Reduction Method: 1. Data Cube aggregation: the clustering method is used for data cubes. 2. Dimension Reduction: checks and deletes irrelevant, weak, or redundant attributes. 3. Data Compression: select the correct encoding and compression dataset. 4. Data Compression: small data is used to represent data, or a short data unit is used, or a data model is used to represent data. 5. discretization and concept hierarchy generation: discretization continuous data and replacement of original values with fixed Finite Segment values; Conceptual Hierarchy refers to replacing low-level concepts with high-level concepts, to reduce the number of values.

14. Data Cube aggregation: Multidimensional modeling and representation of data, composed of dimension and fact.

Dimension Reduction: Remove irrelevant attributes to reduce the data volume processed by data mining.

The basic methods for selecting attribute subsets include: 1. Step forward selection 2. Step Backward deletion 3. Combination of forward selection and backward deletion 4. Decision tree induction 5. reduction based on statistical analysis

Data Compression: There are two types of Methods: lossless compression and lossy compression.

Common Methods for numerical reduction: 1. histogram 2. Clustering 3. Sampling: simple random sampling without replacement, simple random sampling with replacement, clustering sampling and stratified sampling 4. Linear Regression 5. Nonlinear Regression

15. data transformation involves the following aspects: 1. smooth 2. aggregation 3. data Generalization 4. normalization (1) Minimum-maximum normalization (2) Zero-mean normalization (3) decimal calibration normalization 5. attribute Construction

* Normalization: (1) Minimum-maximum normalization. The original value range is [old_min, old_max], and the new value range after normalization is [new_min, new_max].

X' = where X is the real value of the attribute, and x' is the normalized value.

For example, the actual value range of the customer's income attribute in the "customer background data" table is [73600,]. The attribute value must be normalized to [], and the above formula is applied to the attribute value:

X' = (1.0-0) + 0 = 0.716

The decimal point is reserved according to the precision requirement (assuming the precision requirement is 0.01). The final value of 0.72 is the value after the attribute value 73600 is normalized.

(2) Zero-mean normalization (Z-score normalization) is based on the average value and standard deviation of the attribute value, I .e:

X' = is the average value of all sample property values, which is the standard deviation of the sample. You can use this method to normalize an unknown attribute value range.

For example, assume that the mean and standard deviation of an attribute are 80 and 25 respectively, and the zero-Mean Normalization 66 is: x' =-0.56

(3) decimal calibration Normalization: normalization is performed by moving the decimal point of attribute.

X' = is the smallest integer that satisfies the formula <1.

For example, assume that the value range before normalization of an attribute is [-120,110] And the decimal scale is used to normalize 66. Because the maximum absolute value of this attribute is 120, it can be obtained from <1 = 3. Therefore, after 66 is normalized, it is: x' = 0.066.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data Mining notes (III)-data preprocessing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data Mining notes (III)-data preprocessing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support