Concept and Technology of Data Mining-Chapter 3 data preprocessing

Source: Internet
Author: User
I. data preprocessing 1. If data can meet the requirements of its applications, it is of high quality. Data quality involves many factors: accuracy, integrity, consistency, timeliness, credibility, and interpretability.

2. Main data pre-processing tasks: data cleansing, data integration, data conventions, and data transformation.
Ii. Data cleaning: Try to fill in missing values, smooth noise, identify profit points, and correct data inconsistencies. 1. Processing of missing values: 1) Ignore tuples: This is often done when class labels are missing. However, other attributes of the ignored tuples cannot be used, even if they are useful. 2) manual filling: This method is very time-consuming and may not work if the dataset is large and there are many missing values. 3) fill in the missing value with a global constant: Replace the attribute of the missing value with the same constant. (Simple but unreliable) 4) use the central measurement (mean and median) of the attribute to fill in the missing value: the mean value can be used for normal (symmetric) data distribution; use the median for skewed data (asymmetric. 5) use the mean or median values of all samples in the same class as the given tuples: Use another attribute to classify data and calculate the value (mean or median) of the missing value ). 6) Fill in with the most possible values: Use a reasoning-based tool or decision tree of regression and Bayesian formalizing the air valve for inductive determination. 3 )~ 6) The method will make the data worse, but 6) It is the most popular strategy. In addition, missing values do not mean data errors.

2. Noise: random error or variance of the measured variable. Smooth data technology:
1) binning: sort the sorted data by equi-frequency binning, and then:
Smooth box mean: each value in the box is replaced with the average value in the box.
Smooth with box median: each box is replaced with the box median.
Smooth box boundary: the maximum and minimum values in a given box are also considered as box boundary, and each value in the box is replaced with the closest boundary value.
2) regression: A function can be used to fit data to smooth the data. 3) group profit analysis: Uses clustering to detect group profit points. Many smooth data methods are also used for data discretization (a form of data changes) and data reduction.
Data Cleaning Process: 1) Step 1: Deviation detection. There are many factors that lead to deviations, such as incorrect input, intentional errors, data degradation (outdated data), inconsistent encoding, device errors, and system errors. How to detect deviations?
Use any knowledge about the nature of the data: Metadata (data describing the data), basic statistical description of the data (mean, median, mode, variance, standard deviation, etc), Uniqueness rules, continuity rules, null rules.

3. Data Integration data mining often requires data integration-merging data from multiple data storage.
(1) Entity Recognition: Mode integration and object matching may require skills. For example, how can a computer identify that customer_id and cust_number of another database are the same attributes? -- Metadata is used. The metadata of each attribute includes the name, meaning, allowed range of data types and attribute values, and rules for processing null values. These metadata can be used to help avoid errors in mode integration and help change data. (2) redundancy and correlation analysis: If one attribute can be "exported" by another or another set of attributes, this attribute may be redundant; inconsistent attributes or dimension names may also lead to redundant result datasets. Solution: some redundancy can be detected by correlation analysis. For nominal data, use the X2 (chi-square) test. For numeric attributes, we use correlation coefficient and covariance to evaluate how the value of one attribute changes with another.
1) The x2 correlation test of nominal data has two attributes: A and B. Attribute a constitutes a column, and attribute B Constitutes a row (which constitutes a two-dimensional table). Then (AI, BJ) indicates that the I-th value of A and the J-th value of B constitute a joint event. The value of X2 can be calculated as follows:

OIJ is the observed frequency of joint events (AI, BJ), that is, the actual count. Eij is the expected frequency of (AI, BJ), calculated by the following formula:
N indicates the number of data tuples, and count indicates the number of AI values. The x2 statistical test assumes that A and B are independent. If this assumption is rejected, it can be said that a and B are statistically related.

2) Correlation Coefficient of numerical data
Calculate the correlation coefficient of attributes a and B (also known as Pearson product moment coefficient) to estimate the correlation between these two attributes r a and B:

Represents the standard deviation A and B. N is the number of tuples. In addition,-1 <= r a, B <= + 1. If r a and B are greater than 0, A and B are positively correlated, this means that the value a increases with the value B. The greater the value, the stronger the correlation. Therefore, for a high Ra, the B value means that A or B can be deleted as redundancy. If the value is equal to 0, A and B are independent and there is no correlation between them. If the value is less than 0, A and B are negative correlation. One value increases with the decrease of the other value, which means that one attribute prevents the other.
Note: Correlation does not contain a causal relationship. That is to say, if A and B are related, it does not mean that a causes B or B to cause.
3) covariance of numerical data:
The mean of A and B is also known as the expected values of A and B:
The covariance of A and B is defined:

The relationship between covariance and correlation can be found:
For the attributes a and B that tend to change together. If a is greater than a's expectation, B is likely to be greater than B's expectation. Therefore, the covariance of A and B is positive. In addition, if one attribute is smaller than its expectation, and the other is greater than its expectation, the covariance of A and B is negative.
If A and B are independent, that is, they do not have a correlation, then E (A · B) = E (a) · E (B), then Cov (A, B) = E (A · B)-E (a) · E (B) = 0; however, in turn, it is not true. Some random variables (attributes) may have a covariance of 0, but not independent.
(3) tuples duplicate problem (4) data value conflict detection and processing for example: one school is a A-F score, the other school is a score of 1-10. It is difficult to develop accurate score change rules between the two schools, and information exchange is also difficult.

4. Data reduction: the data reduction technology can be used to obtain the reduction representation of a dataset. It is much smaller, but still close to maintaining the integrity of the original data. (1) Data reduction policies include dimension reduction, quantity reduction, and data compression. (1) Dimension Reduction: Reduce the number of random variables or attributes. The methods include wavelet transform and principal component analysis. They transform or project the original data to a small space. Attribute subset selection is a dimension reduction method, in which unrelated, weakly correlated, or redundant attributes or dimensions are detected and deleted. 2) quantity reduction: replace the original data with a small alternative data representation. 3) Data Compression: Use transformations to obtain the reduction or "compression" representation of the original data. If the original data can be reconstructed from the compressed data without losing information, the reduction of the original data is lossless. The approximate reconstruction of the original data is lossy. (2) discrete wavelet transform (DWT): a linear signal processing technique used to convert data vector X into different numerical wavelet coefficient vectors x '.
(3) Principal Component Analysis (PCA), also known as the K-L method, searches k N-dimensional Orthogonal vectors that can represent the data, where k <= n.
(4) attribute subset selection: reduces data volume by deleting irrelevant or redundant attributes (dimensions. The goal is to find the smallest attribute set. Yes, the probability distribution of data classes is as close as possible to the original distribution obtained by using all attributes. In addition, the reduced attribute set mining can reduce the number of attributes that appear in the discovery mode, making the mode easier to understand. (5) regression and logarithm Linear Models: parametric data reduction regression and logarithm Linear models can be used to approximate the given data.
Logarithm Linear Model: Multidimensional probability distribution with approximate discretization. Given a set of n-dimensional tuples, we think of each tuple as a point in n-dimensional space. For discrete attribute sets, embedding uses a logarithm Linear Model, based on a smaller subset of a combination of dimensions, estimate the probability of each vertex in a multi-dimensional space.
(6) histogram uses bins to approximate data distribution. It is a popular data reduction form.

(7) Clustering clustering technology regards data tuples as objects and divides objects into groups or clusters, so that the objects in a cluster "want to die" each other and "different" from the objects in other clusters ". Generally, similarity is based on distance functions.
(8) sample sampling can be used as a data reduction technique because it allows a much smaller random sample to represent a dataset.
Simple random sampling (srswor) with no replacement for S samples ): N tuples are extracted from S samples. The probability of any tuples being extracted in S <n is 1/N (probably ).
S samples have a simple random sampling (srswr): After a tuple is extracted, it is put back into the data set to be extracted again.
Cluster Sampling: Put the dataset into M clusters that do not match each other. A simple random sample (SRS) and S <m of S clusters can be obtained.
Stratified sampling: a dataset is divided into overlapping parts, which are called "layers ". This helps ensure the representativeness of the sample when the data is skewed (eg. stratified by age ).

5. Data Transformation and data discretization (1) data transformation strategy: 1) smooth: remove noise from data. Technology includes binning, regression, and clustering.
2) attribute construction (feature construction): constructs new attributes from the given attributes and adds them to the attribute set to help data mining.
3) aggregation: summarizes or aggregates data.
4) Normalization: Scale attribute data proportionally to a specific interval.
5) discretization: Numerical attribute (eg. The original value of age is replaced by the range label (eg. 0-10, 11-20) or the concept label (youth, adult, senior.
6) conceptual hierarchy is generated from nominal data: generalization of an attribute (eg. Street) to a higher conceptual layer (city ).
(2) Minimum standardization technology-maximum normalization: Wake up linear transformation of raw data.

VI ing to the interval [new_mina, new_maxa]
Z score (z_score) normalization (zero mean normalization): the attribute value is normalized based on the mean (average) and standard deviation of:


Small tree calibration Normalization: normalization is performed by moving the decimal point of the value of attribute.
(V' I <1) (3) binning discretization: binning does not use class information. Therefore, it is an unsupervised discretization technology that is sensitive to the number of boxes specified by users, it is also vulnerable to the influence of outlier.
(4) using histogram analysis to discretization the wind chart analysis is also an unsupervised discretization technology because it does not use class information.
(5) The mathematical values are divided into clusters or groups by means of clustering, decision tree, and correlation analysis discretization. The technology for producing classification decision trees for classification can also be used for discretization, the discretization decision tree methods are supervised, and they use class labels (classification ).
(6) The concept of nominal data is layered to generate a nominal property with multiple different values and unordered values. Such as geographical location and product type.
The method for generating data concept layering in Section 4:
1) users or experts display partial order of attributes in a mode. (The conceptual hierarchy of a nominal attribute or dimension involves a set of attributes. the user or expert describes the partial or full order of the attributes, and then easily defines the conceptual hierarchy .)
2) Explicit Data grouping is used to describe part of the layered structure. For a small part of data, the Group is explicitly described.
3) Description of attribute sets but not their partial order: You can say that an attribute set forms a conceptual hierarchy, but does not explicitly describe their partial order. The system can try to automatically generate the attribute order, construct meaningful conceptual hierarchy.
For example, a high-level conceptual attribute generally contains less different values, while a lower conceptual layer has more attribute values. (Not all situations are met, such as the year from 365 days, 30 days of the month, 7 days of the week)
4) only partial attribute sets are described.

























From Weizhi note (wiz)

Concept and Technology of Data Mining-Chapter 3 data preprocessing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.