0x00 foreword
Often those unimportant function, can destroy your work result most.
This article shares some content related to data quality monitoring.
Data quality monitoring is one of the most easily sacrificed and ignored functions in the rapid development of business, but it is really crucial.
Article structure
The significance and value of data quality monitoring will not be discussed any more. This paper mainly discusses the following three topics:
What to do for
data quality monitoring
What to do
data verification
Some other knowledge points of data warehouse will be involved in this paper, please refer to the previous article.
0x01 what is worth monitoring
I divide data quality into three parts to understand:
monitor
give an alarm
Multiple data sources
Focus on monitoring. In this regard, we will talk about multiple data sources because in the big data scenario, we have too many open-source components to choose from, and the data of many components need to be monitored, and each is different. If we monitor uniformly, it is an important topic.
1、 Monitoring
Monitoring is a big part. Generally speaking, I will divide it into these parts: daily monitoring, data reconciliation and performance monitoring. Let's talk about it separately.
Daily monitoring
The most important part of daily monitoring is data landing inspection, which should be the basis of all monitoring, otherwise you can play with no data.
Here are some common monitoring contents I think:
Data landing monitoring
Data drop 0 monitoring: the actual extension is data threshold monitoring, and an alarm will be given when the data is less than a certain amount
Duplicate data monitoring: many tables must monitor duplicate data, which is very important.
Key indicators monitoring
Data year on year monitoring
This is some commonly used monitoring. We can make a rule engine as mentioned later. All of the above are sitting in the rules. You can accompany any table you need.
Data reconciliation
This is mainly reflected in real-time data, especially Kafka data landing. There must be a monitoring mechanism to know our data landing.
Of course, offline data also needs data reconciliation. There are many reconciliation methods, such as comparison with business database.
Performance monitoring
I understand this as data availability monitoring, which I think is a very important point. What's the difference between what you do and what you don't do if the data you do is used badly by others, or it's too slow to use at all?
I feel that there are several points to pay attention to in performance monitoring:
Query performance, such as a certain index of ES, query response speed in different time periods. Similarly, queries such as presto, hive and kylin need to be noted. This can be observed through task monitoring.
Generally, we don't pay much attention to the impact of data reading and writing, and the impact of machine failure. However, like es, when writing data, it will actually affect reading data. We need to monitor and adjust it accordingly.
2、 Alarm
Needless to say, wechat, SMS and phone are all necessary.
It is also necessary to have regular email summary alarms.
Then there are many alarms that can be displayed by an alarm report system, especially the monitoring content such as data volume trend. The visual comparison is more effective.
3、 Multiple data sources
In the current big data scenario, a lot of open source components are introduced, and new components will be introduced constantly, so we need to consider the data monitoring of different components.
At present, the author has more contacts with hive (PRESTO, spark SQL), mysql, ES, redis, kylin (mainly the built cube), which are commonly used, but the possibility of the introduction of graph database (neo4j, orientdb) and Druid components cannot be excluded.
0x02 how to monitor
Data monitoring is relatively a background system, not an external business system, and its general importance may be challenged. However, it is still worth doing. However, it may need to change some thinking to do it. How to realize and grasp the core function point quickly is a matter worth pondering.
There will be no implementation here, only some design ideas. Welcome to discuss.
As the picture is a whole idea, I first analyze several important points of personal understanding. It will be analyzed in detail later.
Rule engine: to define various alarm rules, it may be an SQL template or some specific algorithms.
Execution engine: to execute various rules, consider the differences of various data sources at the same time.
Metadata system: data quality monitoring is a part of the metadata system originally. Let's talk about it separately, but in any case, when configuring the alarm information of the table, it should be combined with the metadata system.
Let's analyze these components separately.
1、 Rule engine
Take several typical examples: data delay arrival, data year-on-year ratio, data trend, and some customized algorithms.
This design can be very flexible, or a simple one can be developed temporarily. Here are a few points.
SQL template
In most storage engines, the data used through SQL (such as hive and MySQL) will be an important data. We can consider using SQL template for this data.
We will have a table or some configuration files to define our rules. In short, for example, we can write a Presto SQL template to compare with the historical data. This SQL is very simple, just write the template yourself.
This template is the simplest and fastest, and I believe it can solve most problems.
metadata
Many databases are managed by metadata, such as hive, whose table rows are stored in the metabase. We can directly retrieve the daily data volume of the table through hive's metadata.
Note: This is very important. It can save most of our work, and it is stable, but it can meet fewer functions. It needs to be used in combination with others.
Custom template
There are many algorithms that can not be solved by simple SQL, and not all storage systems support SQL. Like es. So we need some customized algorithms to achieve.
The main workload in this area should be on the execution engine, but the rule engine should be designed to.
2、 Execution engine
This should be more important. The implementation can be simple or complex. Let's talk about it.
SQL execution
Many rules can be executed through SQL, which is mentioned in the rule engine.
In fact, I recommend that you do this for the rough monitoring at the beginning. We have configured most of the SQL templates in advance, and then we need to monitor which table to configure in this table.
For specific execution engines, Presto or spark SQL can be considered, and hive can be considered for very large tasks.
advantage:
Simple, easy to implement
Can meet most of the needs
Disadvantages:
Flexibility is not enough. For example, ES does not support SQL very well
Slow speed: many SQL execution will be slow, especially when using hive engine.
Unstable: some monitoring will be unstable, such as repeated data monitoring. For some large tables, using Presto is very difficult to produce results. It will often hang up, but it will be very slow to change to hive.
So how to solve it?
Well, to solve this problem, I only have the following ideas:
Reasonable task scheduling, the general cluster is able to accommodate a lot of tasks, appropriate scheduling of monitoring tasks is more important.
Replace the execution engine reasonably. This next section provides a solution.
Reasonable task dependence, for example, duplicate data monitoring, will inevitably depend on whether the data has arrived. If the data has not arrived, it is unnecessary to execute the program of duplicate data monitoring.
Direct data acquisition
We mentioned an execution efficiency problem of SQL execution earlier, and we provide an optimized method in this section. Because hive is a very important engine at present, let's just say hive.
Hive has metadata management. Its metabase records the records of all tables in hive. These records can be directly used for data volume related monitoring, such as data zero dropping, data volume month on month, data volume trend, etc.
Algorithm execution engine
Many algorithms can be implemented in a custom local way, which will be more complex.
Because of the strong customization, we need a more flexible architecture when designing this part. We will not talk about it here, because in the common data field, the first two points can meet many needs.
Multiple data sources
In the multi data source area, we need to add some distinctions in the rule engine, because after all, it is associated with the metadata system, and the distinction is relatively simple.
During execution, it may be implemented by differentiation. But relatively speaking, it's not very complicated.
0x03 data verification
Before the data verification, I didn't care about it. Now I'll fill in this block. It's more personal understanding, but it hasn't formed a complete knowledge system yet. It mainly means how to judge that your data is normal and trustworthy, which should be very important in data quality.
Methods can include cross validation, abnormal fluctuation monitoring, etc. for the time being, we will not share them, and we will make it clear later. Just wake up here.
0xff summary
This article mainly shares some contents related to data quality monitoring, and has some general feelings. However, after clarifying the ideas, many of them are easy to implement. If you want to make a simple and usable one, you can do it in half a day with Python.
Here is the main idea, the specific implementation will not be written. After all, according to the business requirements, the degree of implementation will be different.