The purpose of data preprocessing: improve the data quality, the three elements of data quality: accuracy, completeness, consistency.
Tasks for data preprocessing:
- Data cleanup
- Data integration
- Data specification
- Data transformation
Data cleansing-fill missing values, smooth noise, identify outliers, correct inconsistencies in data
- Ignore tuples
- Manually fill in missing values
- Use a global constant
- Using the center measure of a property
- Use the attribute mean or median of all samples that belong to the same class as a given tuple
- Use most likely values (most popular)
- Sub-box
- Regression
- Off-Group Point analysis
Data integration-consolidating data from multiple data stores
- Entity recognition issues
- Redundancy and correlation analysis
- Tuple repeats
- Monitoring and processing of data value conflicts
Data specification-The specification of the data set is represented, but still close to preserving the integrity of the original data
- Overview of Data Protocol policies
- Dimensional regression
- Quantity specification
- Data compression
- Wavelet transform--linear signal processing technology, suitable for high dimensional data (HTTP://HI.BAIDU.COM/QINGSHUANGCII/ITEM/31E8831E65350DDE64EABF4C)
- Principal component Analysis--a dimensional regression method, suitable for sparse data
- Attribute subset Selection
- Regression and logarithmic linear models: Parametric data protocol
- Histogram
- Clustering
- Sampling
- Data Cube Aggregation
Data transformation and data discretization--data is transformed or unified into a form suitable for mining, easier to understand
- Overview of policies for data transformations
- Standardization
- Split-Box discretization
- Histogram analysis discretization
- Clustering, decision tree and correlation analysis discretization
- Conceptual layering of nominal data
"Reading notes-data mining concepts and techniques" data preprocessing