Wang Yue Fen Zhangchengzhi Zhang Beibei Wu Tingting
- Definition: Data cleansing refers to the last program that discovers and corrects recognizable errors in a data file, including checking data consistency, handling invalid values and missing values, and so on. Unlike the questionnaire, the data cleaning is usually done by the computer instead of manually.
- Objective: The purpose of data cleansing is to provide accurate and effective data for information system.
- Rationale: The use of relevant technologies, such as statistical methods, data mining methods, pattern rules and so on to transform the dirty data to meet the data quality requirements of the data. Data cleaning According to the implementation mode and scope , can be divided into the following 4 kinds:
Ⅰ Manual Implementation
Ⅱ writing a dedicated application
Ⅲ solving problems with a particular application domain
Ⅳ Data Cleansing independent of specific application areas
Ⅲ, Ⅳ Universal Strong
The Ⅰtrillium model (TRILLIUM[7) is an enterprise-wide data cleansing software created by Technologies Trillium Systems Department of Harte Hanks data software. )
The process of cleaning data is divided into 5 steps:
Ⅱbohn model
The data cleansing is divided into the following 4 main sections:
Ⅲajax model
The data cleansing is divided into 5 steps:
- Data Cleaning Tools
Ⅰ cleaning tools for specific functions (cleaning tools for specific functions)
Ⅱetl Tools (Data Warehouse)
Ⅲ Other Tools
Engine-based tools
Data analysis Tools
Business Process Redesign Tool
Data Profiling Tools
Data Mining Tools
- Data Cleansing Assessment
Ⅰ credibility
Accuracy: Describes whether the data is consistent with the characteristics of its corresponding objective entity.
Completeness: Describes whether the data has missing or missing fields
Consistency: The value of the same attribute that describes the same entity is consistent across different systems
Validity: Describes whether the data meets user-defined conditions or within a certain range of domain values.
Uniqueness: Describes whether the data has duplicate records.
Ⅱ availability
Time: Describes whether the data is current or historical
Stability: Describes whether the data is stable and whether it is within its validity period
"Data Cleansing" 2007-Review of data cleansing