Original: http://dataunion.org/5009.html
One: Why preprocess data?
(1) Real-world data is dirty (incomplete, contains noise, inconsistent)
(2) Without high-quality data, there is no high-quality mining results (high-quality decisions must depend on high-quality data; Data warehouses require consistent integration of high-quality data)
(3) Problems in the original data:
Inconsistencies-inconsistent data inclusions
Repeat
Incomplete--The attributes of interest are not
Noise--Data with errors, or anomalies (deviations from expectations)
High dimensional
Second: The method of data preprocessing
(1) Data Cleaning--de-noising and irrelevant data
(2) Data integration--combining data from multiple data sources in a consistent data store
(3) Data Transformation-transforming raw data into a suitable form for data mining
(4) Data Specification-the main methods include: Data cube aggregation, dimension normalization, data compression, numerical normalization, discretization and concept layering.
(5) Illustrated Facts
Three: Data Selection reference principle
(1) as much as possible. Explicit meaning of attribute name and attribute value
(2) attribute encoding of unified multi-data source
(3) Remove unique attributes
(4) Removing duplicate properties
(5) To remove the Ignorable field
(6) Select the relevant fields reasonably
(7) Further processing:
Eliminate noise in data, fill null values, lose values, and process inconsistent data by filling missing data, eliminating anomalous data, smoothing noise data, and correcting inconsistent data
Four: Speak with a picture,(I'm still used to talking in a chart)
The path of data cleansing: Just get the data--and discuss with the data provider to consult-–> data analysis (with the visualizer) to find dirty data--clean dirty data (with MATLAB or java/c++ language)-–> again statistical analysis (Excel data Analysis good, maximum small, median, majority, average, variance and so on, as well as scatter plot)-–> again found dirty data or data unrelated to the experiment (removal)-–> The final experiment is analyzed--and the social instance validates--and ends.
A data cleanup
Attempts to populate missing values, smooth noise and identify outliers, and correct inconsistencies in the data.
1) Handling Missing values method:
A. Ignore the meta-ancestor, the mining task involved in the classification task if the class label is missing usually do this
B. Manually fill in missing values, which is not feasible when the volume is large
c. Populating missing values with a global constant, simple but unreliable
D. Populating missing values with a property's mean
E. Using attribute mean values for all samples that belong to the same class as a given tuple
F. Using the most likely values to populate the missing values can be used with regression, using Bayesian formalized inference-based tools or decision tree induction to determine, is a popular practice.
2) Data Smoothing technology: Noise is the random error or variance of the variable being measured
A. Sub-box, the Sub-box method by examining the data "nearest neighbor" (that is, the surrounding values) to smooth the values of ordered data, ordered values distributed to some "barrels" or boxes. Because of the value of the nearest neighbor, the sub-box method is local smooth. Several sub-box technology: The box is smooth, with the box boundary smooth, with the median of the box smooth.
B. Regression: You can fit data with a function (such as a regression function) to smooth the data. Linear regression involves finding the "best" line that fits two attributes (or variables), and one property can be used to predict the other. Multivariate linear regression is an extension of linear regression, which involves more than two properties, and the data fits into a multidimensional surface.
C. Clustering: Detection of outliers by clustering
3) Data cleansing as a process approach: the first step in the process is the deviation detection, there are a lot of business tools to help us with the deviation detection, data cleansing tools, data audit tools, data migration tools, ETL tools. The new data cleansing approach emphasizes enhanced interactivity, such as Potter's Wheel, which integrates bias detection and data transformation.
Two-data integration and transformation
1) Data integration: Data analysis tasks mostly involve data integration. Data integration merges data from multiple data sources and resides in a consistent data store, such as a data warehouse. These data sources may include multiple databases, data cubes, or generic files. There are three main problems with data integration: A. Pattern integration and object matching, entity recognition problems: how can the real-world equivalent entities from multiple sources of information match? Metadata can help avoid errors in schema integration. B. Redundancy: Some redundancy can be detected by correlation analysis. by calculating the correlation coefficient of the attribute A, a, and the Pearson moment coefficient, the correlation between two attributes A and B can be judged by Chi-square test for discrete data. C. Detection and processing of data value conflicts
2) Data transformation: Transform or unify data into a form suitable for mining. Includes the following:
A. Smoothing: Eliminate noise from data, including sub-bins, regression and clustering
B. Aggregation: Aggregate or aggregate data. This step is typically used to construct a data cube for multi-granularity data analysis
C. Data generalization: Use concept layering to replace underlying or "raw" data with high-level concepts.
D. Normalization: Also known as normalization, feature scaling feature scaling. Scales the attribute data proportionally to a small, specific interval. Normalization method:
1. Min-Max normalization: V ' =[(v-min)/(Max-min)]* (new_max-new_min) +new_min
2.z-score normalization (or 0 mean normalization): V ' = (mean E of V-attribute a)/standard deviation of attribute a ∽
3. Decimal calibration Normalization: V ' =v/10 J-Square, J is to make Max (|v ' |) Minimum integer of <1
E. Property constructs (or feature constructs): You can construct new properties and add them to a property set to help with the mining process.
Three data attribution
The data set can be very large! Complex data analysis and mining will take a long time to face massive data. The data-reduction technique can be used to get the data-set representation, which is much smaller, but still close to preserving the integrity of the original data. The data normalization strategy is as follows:
1) Data cube aggregation: The aggregation operation is used for data in the data cube structure. The data cube stores multidimensional aggregation information.
2) attribute subset selection, see Feature selection algorithm in text categorization overview
3) Dimension Normalization: Use data encoding or transformation to get the approximate or "compressed" representation of the original data. The imputation is divided into lossless and lossy. The effective method of lossy dimensional regression is: wavelet transform and principal component analysis
4) Value reduction: Reduce the amount of data by selecting an alternative, ' smaller ' representation of the data
5) discretization and concept layering generation
Data preprocessing (full steps)