0x00 overview
With the era of
big data, the application of data is becoming more and more prosperous. More and more applications and services are based on data, and the importance of data is self-evident. Moreover, data quality is the basis of the validity and accuracy of data analysis and data mining conclusions, and also the premise of all these data-driven decision-making! How to ensure data quality and data availability is an important link that can not be ignored by every data person.
Data quality is mainly evaluated from four aspects: integrity, accuracy, consistency and timeliness. This paper will analyze and explain these four aspects in detail in combination with business process and data processing process.
Data is ultimately to serve the business value, so this paper will not simply explain the theory, but from the application of data quality monitoring as the starting point, to share the residents' Thinking on data quality. Through this article, you will get the following knowledge points:
Key points of data quality core concern
Understand from the data calculation chain what data quality problems will appear in each link
Understanding from business logic, the help of data quality monitoring
Points to be concerned when implementing data quality monitoring system
Some difficulties and solutions of data quality monitoring
0x01 four major concerns
In this section, let's briefly talk about four points of data quality that need attention: integrity, accuracy, consistency and timeliness. These four concerns will be reflected in all aspects of our data processing process.
1、 Integrity
Integrity refers to whether the records and information of data are complete and whether there is any missing situation. The lack of data mainly includes the lack of records and the lack of information of a certain field in records, both of which will cause inaccurate statistical results, so integrity is the most basic guarantee of data quality.
In short, if you want to do monitoring, you need to consider two aspects: first, whether the number of data pieces is small, and second, whether the value of some fields is missing. Integrity monitoring, mostly at the log level, generally checks data integrity when data is accessed.
2、 Accuracy
Accuracy refers to whether the information and data recorded in the data are accurate, whether there is abnormal or wrong information.
Intuitively speaking, it depends on whether the data is accurate. Generally, the accuracy monitoring focuses on the monitoring of business result data, such as whether the daily activity, revenue and other data are normal.
3、 Consistency
Consistency refers to whether the results of the same indicator are consistent in different places.
Data inconsistency often occurs when the data system reaches a certain degree of complexity, and the same indicator will be calculated in multiple places. Due to different calculation caliber or developers, it is easy to cause different results of the same indicator.
4、 Timeliness
After ensuring the integrity, accuracy and consistency of the data, the next step is to ensure that the data can be produced in time, so as to reflect the value of the data.
Timeliness is easy to understand. It is mainly about whether the speed of data calculation is fast enough. This can be reflected in the data quality monitoring whether the monitoring result data is calculated before the specified time point.
0x02 data quality of each link of data processing
The reason why data quality monitoring is difficult to do is that there will be data quality problems in all aspects of data. Therefore, this section will take a typical data processing chain as an example to share what data quality problems are prone to occur in each stage.
I divide data processing into three stages: data access, intermediate data cleaning and result data calculation.
Data access
Data access link is the most prone to data integrity problems, here we should pay special attention to whether the data volume increases and drops sharply.
A sharp increase means that a large number of data may be repeatedly reported or abnormal data may be intruded, and a sharp drop means that data loss may occur.
On the other hand, it is also necessary to check whether the values of different fields are lost, such as whether there are a large number of null values in address and equipment fields.
Data cleaning
Here, I limit the scope of data cleaning to the middle table cleaning of data warehouse, which is also the core part of our data warehouse construction. To a certain extent, the construction of data middle layer is essential!
In this link, the most likely problems are data consistency and data accuracy. The data middle layer guarantees that the data is exported from a unified way, making the data right or wrong together. However, it is difficult to ensure the accuracy of data, so in the data cleaning stage, we need to guarantee the accuracy of data as much as possible.
Data results
The result data mainly focuses on the process of providing data to the outside world, which is generally the demonstrable data calculated from the middle table or obtained directly. This is the place that the business side and the boss are most likely to perceive. Therefore, in this link, the main concern is data accuracy and data timeliness.
Generally speaking, the integrity, accuracy, consistency and timeliness of data need to be paid attention to in all stages of data processing, but the core problems can be solved first.
0x03 data quality of each link of business process
After talking about data processing, let's continue to talk about business processes. The ultimate value of data is to serve the business, so data quality is best to start from solving business problems. Therefore, this section explains how to do data quality from a typical business scenario.
First of all, Jushi believes that since monitoring is definitely to consider users, and a very important role of our data quality monitoring platform is to let the owners, products and operators of these users rest assured of our data, what are their concerns? Ju Shi thinks it's a business indicator!
Then, this business indicator can be considered from two perspectives:
The value of a single indicator is abnormal. For example, does the data reach a certain critical value? Is there a sharp increase and drop?
Is there any abnormality in the data of the whole business chain, such as the transformation from exposure to registration?
An app's user behavior funnel analysis is actually a simple link from getting users to transforming.
So for this link, what we need to do in data quality monitoring is not only to tell the user that there is a problem with the value of a node, but also to tell them where there is a problem in the whole chain and where the conversion is low.
0x04 how to realize data quality monitoring
The focus of data quality and how to focus on data quality from the perspective of technology and business are shared earlier. This section will briefly share how to achieve data quality monitoring. Here will be divided into two perspectives: the macro design ideas and technical implementation ideas.
1、 Design ideas
The design of data quality monitoring is divided into four modules: data, rules, alarm and feedback.
Data: mainly data to be monitored by data quality. Data may be stored in different storage engines, such as hive, PG, ES, etc.
Rule: it refers to how to design rules for discovering exceptions. Generally speaking, it mainly refers to numerical exceptions, aspect ratio and other abnormal monitoring methods. There will also be some ways to discover abnormal data through algorithms.
Alarm: alarm refers to the action of starting alarm. Here, the alarm content can be triggered by wechat message, phone, SMS or wechat applet.
Feedback: special attention should be paid here. Feedback refers to the feedback on the alarm content, such as the received alarm content. Then the person in charge should respond to whether the alarm message is a real exception, whether the exception needs to be ignored, and whether the exception has been handled. With the feedback mechanism, the whole data quality monitoring is easy to form a closed loop. More business value.
2、 Technical proposal
As for the technical scheme, there are not many details described here, because different companies and teams have different considerations for the implementation scheme. If it is simple to do, you can write some timing scripts, and if it is complex, you can make a distributed system. Here, you can also refer to part of the content No.22 rambling about data quality monitoring written by the resident in 17 years.
This article simply describes several points that need to be paid attention to in the technical implementation:
At the beginning, you can focus on the content to be monitored by the core, such as accuracy. Then you can monitor some indicators of the core, and do a large system without starting.
As far as possible, the monitoring platform should not do too complex rule logic, and only monitor the result data. For example, to monitor whether the log volume fluctuates too much, the calculation process should be preceded by calculating the result table. Finally, the monitoring platform only monitors whether the result table is abnormal.
There are two ways to monitor multiple data sources: to customize and implement a part of calculation logic for each data source, or to write data results from multiple data sources into a data source through additional tasks, and then monitor the data source, which can reduce the development logic of the data monitoring platform. Specific advantages and disadvantages can be measured by yourself.
The main difference between real-time data monitoring and off-line data monitoring lies in the different scanning cycle, so the off-line data can be taken as the main data in the design, but the design of real-time monitoring should be reserved as much as possible.
At the beginning of the design, we should reserve the design of algorithm monitoring as much as possible, which is a big bonus item, and the specific combination way can also be close to the second suggestion, for example, putting the algorithm abnormal data into a result table, and then configuring simple alarm rules on it.