The previous article introduced some basic concepts of data quality, data quality control as the basic link of Data Warehouse, is the foundation of guaranteeing the application of upper layer data. Data quality assurance consists mainly of data Profiling, data auditing and data revision (information correcting) three parts, the previous article describes the content of the information Profiling, Profiling the process of obtaining the data of the summary statistics, so the following will use these statistics to audit the quality of the data, check the data for dirty data, so this article mainly introduces the content of the data auditing.
Basic elements of data quality
First of all, how to evaluate the quality of the data, or what kind of data is to meet the requirements of the data? can be considered from 4 aspects, these 4 aspects together constitute the data quality of 4 basic elements.
Integrity
Data records and information is complete, whether there is a missing situation.
The missing of the data is mainly the missing of the record and the missing of a field information in the record, both will cause inaccurate statistic result, so integrality is the basic guarantee of data quality, and the evaluation of integrality is comparatively easy.
Consistency
Whether the records of the data conform to the specification and whether they are consistent with before and after and other data sets.
The consistency of data includes the specification of data record and the consistency of data logic. Data logging specifications are mainly data coding and format problems, such as the site's user ID is 15 digits, the product ID is 10 digits, the product includes 20 categories, IP address must be used "." Divided 4 0-255 of the number of components, and some of the defined data constraints, such as the integrity of the non-null constraints, unique value constraints, etc. data logic is mainly the consistency of index statistics and calculation, such as PV>=UV, the proportion of new users between 0-1 and so on. Data consistency audit is a relatively important and complex piece of data quality audit.
Accuracy
Whether the information and data recorded in the data is accurate and whether there are any exceptions or incorrect information.
The reason for the consistency problem may be that the rules for data logging differ, but not necessarily error, and accuracy is concerned about the errors in the data records, such as character data garbled phenomenon should also be classified to the accuracy of the assessment category, the other is abnormal value, unusually large or unusually small number, does not meet the validity of the required value , such as the number of visits visits must be integer, age is generally between 1-100, conversion rate must be between 0 to 1 of the value. Auditing for data accuracy can sometimes be difficult because it is difficult to find error values that are not significantly abnormal.
Timeliness
The data is generated to a time interval that can be viewed, also known as the delay time of the data.
Although the real-time requirements of analytical data are not too high, but does not mean that there is no requirement, the analyst can accept the day's data to be able to see the next day, but if the data to be delayed two or three days to come out, or the weekly data analysis report will be two weeks later, then the analysis of the conclusion may have been lost timeliness, Analysts ' work is futile; at the same time, some real-time analysis and decision-making need to use the hour or minute level of data, these requirements on the timeliness of the data is very high. So timeliness is also one of the components of data quality.
Data Auditing
Based on the 4 elements of data quality, data can be audited to assess whether the data to meet the integrity, consistency, accuracy and timeliness of the 4 requirements, in which the timeliness of data is mainly related to data synchronization and processing efficiency, more through the monitoring of ETL tasks to ensure the timeliness of data, So the data audit here refers to the completeness, consistency and accuracy of the data.
Integrity
We'll look at the data statistics from profiling to see what can be used to audit the integrity of the data. The first is the completeness of the record, which typically uses the number of records and unique values of statistics. For example, the number of log records per day is relatively constant, about 10 million fluctuation, if the log number of a day fell to only 1 million, it is likely that the record is missing, or the site's access records should be distributed in 24 hours of the day, if there is no user access to the whole point of the record, So it's possible that the site was having problems at the time, or that there was a problem with the log record transmission at that time, and if the geographical distribution of the visitors were to include 32 provinces in the country, it would probably be missing if the number of unique provinces was less than 32.
On the other hand, the data for a field in a record is missing, and you can use the number of empty values (NULL) in the statistics to audit. If the information in a field is theoretically necessary, such as the address of the page visited, the item ID of the purchase, and so on, the number of empty values for these fields should be 0, which we can use Non-null (NOT NULL) constraints to ensure the integrity of the data, and for some fields that allow Nulls, For example, the user's cookie information does not necessarily exist (the user disables cookies), but the amount of null value is basically constant, such as the cookie is empty, the proportion of users is usually 2%-3%, we can also use the number of empty statistics to calculate the value of NULL, if the ratio of empty value increase It is quite possible that there is a problem with the record of this field and the information is missing.
Consistency
If the data record format has standard coding rules, then the consistency of the data records test is relatively simple, as long as the verification of all records to meet the coding rules can be, the simplest is to use the length of the field, the number of unique values of these statistics. For example, the code for the user ID is 15 digits, the longest and shortest characters in a field should be 15, or the product ID is followed by a 10-digit number, which can be tested in the same way; if the field must be guaranteed unique, the number of unique values of the field should be the same as the number of records, such as the user's registered mailbox; such as the provincial municipalities of the region must be unified coding, records must be "Shanghai" instead of "Shanghai", "Zhejiang" instead of "Zhejiang province", you can map these unique values to a valid list of 32 provinces and cities, if unable to map, then the field does not pass consistency test.
The validation of logical rules in consistency is relatively complex, and many times the consistency of statistical logic of indicators requires the assurance of the quality of the underlying data, as well as the definition of a very normative and standard statistical logic, which must be guaranteed to be consistent. The mistake we often make is that the result of summarizing and subdividing data is not right, and the reason for this problem is that data is broken down to exclude data that cannot be explicitly grouped into a subdivision, such as when you subdivide a source of access, If we are unable to explicitly classify some sources that are not directly accessible to the external link, search engines, advertising, and other established sources of classification, but should not directly filter out the data, but should give an "unknown source" classification, to ensure that based on the data after the breakdown of the source combined or can be consistent with the overall data. If you need to audit the consistency of these data logic, we can establish some "validation rules", such as a>=b, if c=b/a, then the value of C should be within [0,1] and so on, the data can not meet these rules can not pass the consistency test.
Accuracy
The accuracy of the data may exist in an individual record, or it may exist in the entire dataset. If there is an error in the data in one field of the entire dataset, such as a common order of magnitude error, it is easy to see that such a problem can be found by using the mean and median of data profiling. When there are individual exception values in the dataset, you can use the maximum and minimum statistics to audit, or use the box line diagram can also make the exception record at a glance.
There are several audit questions of accuracy, character garbled problem or character is truncated problem, you can use the distribution to find such a problem, the general data records basically conform to normal distribution or class normal distribution, then those who accounted for less than the exception of the data items are likely to have problems, such as a certain character records accounted for a total of only 0.1%, While the other accounted for more than 3%, then it is possible that the character record is abnormal, some ETL tool data quality audit will identify such a small proportion of the record value of the exception. For data in a range of values, it is also possible to limit the validity, and it is wrong to define data records that are more effective than data.
If the data is not significantly abnormal, however, it is still possible to record the wrong values, but these values are close to the normal value, this kind of accuracy test is the most difficult, generally can only be compared with other sources or statistical results to find the problem, if the use of more than a set of data collection system or Web analytics tools, The accuracy of some data records can be found by comparison of data from different data sources.
The above has been from the data profiling statistics, through the information auditing found in the quality of some of the problems, then the data should be cleaned and corrected for these problems, that is, the next article to introduce the content--data correcting , data correction.