Data Warehouse: Data Quality Monitoring

Last Update:2020-06-18 Source: Internet

Author: User

Keywords data warehouse data quality data quality monitoring

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

How to monitor
Data monitoring is relatively a background system, not an external business system, and its general importance may be challenged. However, it is still worth doing. However, it may need to change some thinking to do it. How to realize and grasp the core function point quickly is a matter worth pondering..
Rule engine: to define various alarm rules, it may be an SQL template or some specific algorithms.
Execution engine: to execute various rules, consider the differences of various data sources at the same time.
Metadata system: data quality monitoring is a part of the metadata system originally. Let's talk about it separately, but in any case, when configuring the alarm information of the table, it should be combined with the metadata system.
Let's analyze these components separately.
1、 Rule engine
Take several typical examples: data delay arrival, data year-on-year ratio, data trend, and some customized algorithms.
This design can be very flexible, or a simple one can be developed temporarily. Here are a few points.
1. SQL template
In most storage engines, the data used through SQL (such as hive and MySQL) will be an important data. We can consider using SQL template for this data.
We will have a table or some configuration files to define our rules. In short, for example, we can write a Presto SQL template to compare with the historical data. This SQL is very simple, just write the template yourself.
This template is the simplest and fastest, and I believe it can solve most problems.
2. Metadata
Many databases are managed by metadata, such as hive, whose table rows are stored in the metabase. We can directly retrieve the daily data volume of the table through hive's metadata.
Note: This is very important. It can save most of our work, and it is stable, but it can meet fewer functions. It needs to be used in combination with others.
3. User defined template
There are many algorithms that can not be solved by simple SQL, and not all storage systems support SQL. Like es. So we need some customized algorithms to achieve.
The main workload in this area should be on the execution engine, but the rule engine should be designed to.
2、 Execution engine
This should be more important. The implementation can be simple or complex. Let's talk about it.
1. SQL execution
Many rules can be executed through SQL, which is mentioned in the rule engine.
In fact, I recommend that you do this for the rough monitoring at the beginning. We have configured most of the SQL templates in advance, and then we need to monitor which table to configure in this table.
For specific execution engines, Presto or spark SQL can be considered, and hive can be considered for very large tasks.
advantage:
Simple, easy to implement
Can meet most of the needs
Disadvantages:
Flexibility is not enough. For example, ES does not support SQL very well
Slow speed: many SQL execution will be slow, especially when using hive engine.
Unstable: some monitoring will be unstable, such as repeated data monitoring. For some large tables, using Presto is very difficult to produce results. It will often hang up, but it will be very slow to change to hive.
So how to solve it?
Well, to solve this problem, I only have the following ideas:
Reasonable task scheduling, the general cluster is able to accommodate a lot of tasks, appropriate scheduling of monitoring tasks is more important.
Replace the execution engine reasonably. This next section provides a solution.
Reasonable task dependence, for example, duplicate data monitoring, will inevitably depend on whether the data has arrived. If the data has not arrived, it is unnecessary to execute the program of duplicate data monitoring.
2. Direct data acquisition
We mentioned an execution efficiency problem of SQL execution earlier, and we provide an optimized method in this section. Because hive is a very important engine at present, let's just say hive.
Hive has metadata management. Its metabase records the records of all tables in hive. These records can be directly used for data volume related monitoring, such as data zero dropping, data volume month on month, data volume trend, etc.
3. Algorithm execution engine
Many algorithms can be implemented in a custom local way, which will be more complex.
Because of the strong customization, we need a more flexible architecture when designing this part. We will not talk about it here, because in the common data field, the first two points can meet many needs.
4. Multiple data sources
In the multi data source area, we need to add some distinctions in the rule engine, because after all, it is associated with the metadata system, and the distinction is relatively simple.
During execution, it may be implemented by differentiation. But relatively speaking, it's not very complicated.
data verification
Before the data verification, I didn't care about it. Now I'll fill in this block. It's more personal understanding, but it hasn't formed a complete knowledge system yet. It mainly means how to judge that your data is normal and trustworthy, which should be very important in data quality.
Methods can include cross validation, abnormal fluctuation monitoring, etc.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More