Data Mining notes (III)-data preprocessing

Source: Internet
Author: User

1. Problems with raw data: inconsistency, duplication, noise, and high dimension.

2. data preprocessing includes data cleansing, data integration, data transformation, and data reduction methods.

3. Principles of data used in Data Mining

The proper attribute should be selected from the raw data as the data mining attribute. The selection process should refer to the principle of giving the attribute name and attribute value a clear meaning as much as possible; Uniform attribute value encoding for multiple data sources; unique attributes are removed, repeatability is removed, negligible fields are removed, and associated fields are reasonably selected.

4. method for handling the vacancy value: Ignore the record, remove the attribute, manually enter the vacancy value, use the default value, use the attribute average value, use the average value of similar samples, and predict the most likely value.

5. Noise Data Processing Methods: binning, clustering, computer and manual check, and Regression

6. binning: The binning method is a simple and common preprocessing method. The final value is determined by examining adjacent data. The so-called "binning" is actually a subinterval divided by attribute values. If a property value is within a subinterval, put the attribute value in the "box" represented by this subinterval. Put the data to be processed (attribute values of a column) into some boxes according to certain rules, investigate the data in each box, and use some method to process the data in each box respectively. When adopting the binning technology, we need to determine two main problems: how to binning and how to smoothly process the data in each box.

There are four binning Methods: equi-depth binning method, equi-width binning method, minimum entropy method, and user-defined interval method.

The same weight is also used to separate datasets into bins based on the number of records. Each Bin has the same number of records. The number of records in each bin is called the depth of the bin. This is the simplest binning method.

The uniform interval, also known as the same width binning method, distributes the dataset evenly across the range of the entire attribute value. That is, the interval range of each box is a constant called the box width.

User-defined intervals: You can customize intervals as needed. This method can be used to help you easily observe the data distribution within certain intervals.

For example, the value of the customer's income attribute income after sorting (RMB): 800 1000 1200 1500 1500 1800 2000 2300 2500 2800 3000 3500 4000 4500. The binning result is as follows.

Uniform weight: Set the weight (Box depth) to 4, after binning

Case 1: 800 1000 1200 1500

Case 2: 1500 1800 2000 2300

Case 3: 2500 2800 3000 3500

Case 4: 4000 4500 4800 5000

Unified range: Set the range (Box width) to 1000 yuan, after binning

Case 1: 800 1000 1200 1500 1500 1800

Case 2300 2500 2800 3000

Case 3: 3500 4000 4500

Case 4: 4800 5000

User-Defined: for example, divide the customer's income into less than 1000 RMB, 1000 ~ 2000, 2000 ~ 3000, 3000 ~ Groups of 4000 and 4000 yuan or more, after binning

Case 1: 800

Case 1200 1500 1500 1800 2000

Case 3: 2300 2500 2800 3000

Case 4: 3500 4000

Case 5: 4500 4800 5000

7. Data Smoothing Method: Average, boundary, and median.

(1) smooth by average

Calculate the average value for the data in the same box value, and use the average value to replace all the data in the box.

(2) Smoothing by Boundary Value

Replace each data in the box with a smaller boundary value.

(3) Smoothing by the mean value

Take the value of the box to replace all the data in the box.

8. Clustering: groups a set of physical or abstract objects into multiple classes composed of similar objects.

Identify and clear values (isolated points) that fall outside the cluster. These isolated points are considered as noise.

9. regression: it tries to find the Change Pattern between two related variables. By making the data suitable for a function to smooth the data, it creates a mathematical model to predict the next value, including linear regression and nonlinear regression.

10. Data Integration: combines heterogeneous data from multiple files or databases and stores them in a consistent data storage. Consider the following issues: 1. pattern matching 2. data redundancy 3. Data Value Conflict

11. data Transformation: 1. smooth 2. aggregation 3. data Generalization 4. normalization (1) Minimum-maximum normalization (2) Zero-mean normalization (3) decimal calibration normalization 5. attribute Construction

12. Data Integration: combines heterogeneous data from multiple files or databases and stores them in a consistent data storage. Consider the following issues: 1. pattern matching 2. data redundancy 3. Data Value Conflict

13. Data Reduction: to obtain a Data Mining dataset that is much smaller than the original data, but does not undermine data integrity, the dataset can obtain the same mining result as the original data.

Data Reduction Method: 1. Data Cube aggregation: the clustering method is used for data cubes. 2. Dimension Reduction: checks and deletes irrelevant, weak, or redundant attributes. 3. Data Compression: select the correct encoding and compression dataset. 4. Data Compression: small data is used to represent data, or a short data unit is used, or a data model is used to represent data. 5. discretization and concept hierarchy generation: discretization continuous data and replacement of original values with fixed Finite Segment values; Conceptual Hierarchy refers to replacing low-level concepts with high-level concepts, to reduce the number of values.

14. Data Cube aggregation: Multidimensional modeling and representation of data, composed of dimension and fact.

Dimension Reduction: Remove irrelevant attributes to reduce the data volume processed by data mining.

The basic methods for selecting attribute subsets include: 1. Step forward selection 2. Step Backward deletion 3. Combination of forward selection and backward deletion 4. Decision tree induction 5. reduction based on statistical analysis

Data Compression: There are two types of Methods: lossless compression and lossy compression.

Common Methods for numerical reduction: 1. histogram 2. Clustering 3. Sampling: simple random sampling without replacement, simple random sampling with replacement, clustering sampling and stratified sampling 4. Linear Regression 5. Nonlinear Regression

15. data transformation involves the following aspects: 1. smooth 2. aggregation 3. data Generalization 4. normalization (1) Minimum-maximum normalization (2) Zero-mean normalization (3) decimal calibration normalization 5. attribute Construction

* Normalization: (1) Minimum-maximum normalization. The original value range is [old_min, old_max], and the new value range after normalization is [new_min, new_max].

X' = where X is the real value of the attribute, and x' is the normalized value.

For example, the actual value range of the customer's income attribute in the "customer background data" table is [73600,]. The attribute value must be normalized to [], and the above formula is applied to the attribute value:

X' = (1.0-0) + 0 = 0.716

The decimal point is reserved according to the precision requirement (assuming the precision requirement is 0.01). The final value of 0.72 is the value after the attribute value 73600 is normalized.

(2) Zero-mean normalization (Z-score normalization) is based on the average value and standard deviation of the attribute value, I .e:

X' = is the average value of all sample property values, which is the standard deviation of the sample. You can use this method to normalize an unknown attribute value range.

For example, assume that the mean and standard deviation of an attribute are 80 and 25 respectively, and the zero-Mean Normalization 66 is: x' =-0.56

(3) decimal calibration Normalization: normalization is performed by moving the decimal point of attribute.

X' = is the smallest integer that satisfies the formula <1.

For example, assume that the value range before normalization of an attribute is [-120,110] And the decimal scale is used to normalize 66. Because the maximum absolute value of this attribute is 120, it can be obtained from <1 = 3. Therefore, after 66 is normalized, it is: x' = 0.066.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.