Data quality is the basis of data application. Its evaluation criteria mainly include four aspects, integrity, consistency, accuracy and timeliness.
Definition of data quality
From the perspective of data users, high-quality data should be data that can fully meet the requirements of users.
Data quality standards
1: Timeliness: the timeliness of data acquisition, mainly refers to the timeliness of data extraction, transmission, transformation, loading and presentation. In all aspects of data processing, timeliness is involved. We generally consider two aspects: first, whether the interface data can be extracted in time. The second is whether the presentation layer can be displayed in time.
2: Integrity: refers to whether the data is complete, whether the described data elements, element attributes and element relationships exist or not, mainly including the content of entity missing, attribute missing, record missing and reference integrity of primary and foreign keys.
3: Consistency: the first is that the original data, that is, the number of data records in the file interface is consistent with the number of data records in the warehouse. The second is that the same indicator should be consistent everywhere.
4: Validity: describes whether the data value is within the range of defined value field, mainly including the validity of data format, data type, value field and related business rules.
5: Accuracy: mainly refers to the accuracy of index algorithm and data processing process. This accuracy is mainly guaranteed by the combination of index algorithm, data processing sequence and manual inspection defined in metadata management.
Data quality inspection in data warehouse
Check of interface data. Interface data mainly includes documents and databases
Data quality problems of interface content itself: timeliness, effectiveness, integrity
Monitoring of file interface acquisition program: whether the file interface acquisition program starts and ends normally, etc
Data at the data warehouse level
Data processing process monitoring: whether to schedule on time and whether it is successful.
Inspection of key indicators:
Inspection of basic indicators
Numerical inspection: mainly through the inspection of the value of a single index to find the abnormal and sudden changes of the index. Here you need to set the corresponding threshold.
Fluctuation inspection: mainly year-on-year or month on month inspection. First, calculate the year-on-year or month on month volatility of the index, and then compare it with the predetermined upper and lower limit (threshold) of volatility.
Association check: analyze the change and fluctuation of two indicators with association relationship (such as positive association relationship between increase and decrease).
Balance check: check the potential balance or other comparative relationships among various indicators through simple four operations (addition, subtraction, multiplication and division) of several indicator values.
Weighted fluctuation inspection: through the weighted calculation and analysis of the basic inspection results and influencing factors of a single index, comprehensively inspect the fluctuation and change of the index.
Data quality evaluation process
Data quality demand analysis
Determine the evaluation object and scope
Select data quality dimensions and evaluation criteria
Determination of quality measurement and evaluation method
Using methods to evaluate
Results analysis and rating
Quality results and reports
Evaluation method of
data quality
Basic concepts
Model M = < D, I, R, W, e, s >
D (dataset) is the dataset to be evaluated
Indicators to be evaluated on I (indicator) data set D, such as integrity, accuracy, consistency, etc
R (rule) rules corresponding to evaluation indexes
W (weight) gives the weight of rule R (an integer greater than 0), which describes the proportion of the rule in all rules.
The expected value (a real number between 0 and 100) given by E (expectation) to rule R is the expected result of the rule before evaluation.
The final result (a real number between 0 and 100) corresponding to rule r of S (result) is the result obtained after detecting the rule.
Structural technology
There are four steps to construct the data quality evaluation model: determining the application view of data set evaluation, selecting evaluation indicators, making rule sets, and calculating the score of rule results.
The following will be combined with specific examples to illustrate how to construct data quality evaluation model.
1. Determine the application view of dataset evaluation
In the process of data quality assessment, the first step is to put forward the needs of data quality assessment, to determine which data are of interest to users (including databases, datasets in databases and fields on datasets), and to establish corresponding user views for them.
2. Select evaluation indicators
For each given dataset, select the required evaluation indicators: for customer, select integrity and effectiveness.
3. Rulemaking
According to the selected evaluation indicators, data quality evaluation rules are formulated, and their corresponding weights and expectations are determined. For customer, set the following rules for integrity and effectiveness metrics:
(1) ID is not empty (weight: 5, expected value: 90): integrity
(2) ID length is 18 bits (weight: 10, expected value: 90): accuracy
(3) The value of sex is f or m (weight: 10, expected value: 98): validity
4. Calculation rule result score
For each rule r in the rule set, check the data instances in the data set, calculate the percentage of data tuples that satisfy R, and get the result s corresponding to R. Calculating the percentage of the total number of data tuples is the final result