Classification of data quality problems
This paper mainly discusses the data quality problem of instance layer .
Data quality assessment (12 dimensions)
1) Data specification: Data standards, data models, business rules, metadata and reference data for the existence, completeness, quality and archiving of measurement standards;
2) Data integrity Guidelines (Integrity Fundamentals): Measurement of data for existence, validity, structure, content, and other basic data characteristics;
3) Repeat (duplication): Measurement criteria that are accidentally duplicated in a particular field, record, or data set that exists within or between systems;
4) accuracy (accuracy): The standard for measuring the correctness of data content;
5) Consistency and synchronization (consistency and synchronization): Measurement of the equivalence of information stored or used in a variety of data warehouses, applications, and systems, as well as the measurement standards for data equivalence processing processes;
6) Timeliness and availability (timeliness and availability): Measurement of the timeliness and availability of data for a particular application during the expected time period;
7) Usability and maintainability (Ease of Use and maintainability): the degree to which data can be accessed and used, and the level of measurement that data can be updated, maintained, and managed;
8) Data coverage: the availability and comprehensiveness of measurement standards relative to data overall or to all related objects;
9) Expression quality (Presentation quality), how to make effective information expression and how to collect information from the user measurement standards;
10) understandable, relevant and credible (Perception,relevance and trust): the level of understanding of data quality and the measure of execution in data quality, as well as the measurement of the importance, usefulness and relevance of the data required for business;
11) Data Decay (decay): The standard for measuring the negative rate of data change;
12) utility (transactability): Data produces a measurement standard that expects a business transaction or the degree of results.
In the process of evaluating the quality of the project data, we need to select several appropriate data quality dimensions, then make the evaluation plan for each selected dimension, select the appropriate assessment method, and finally combine and analyze all the quality evaluation results.
Cleaning method1) Missing data processing
2) Similar duplicate object detection
3) Abnormal data processing
4) Logic Error detection
5) Inconsistent data
"Data Cleansing" 2013-Data quality and data cleansing methods