Data Quality Monitoring: Design of monitoring system

Source: Internet
Author: User
Keywords data quality data quality monitoring data quality monitoring system
(1) Source system
The source system is mainly used to manage various rules, receive abnormal information and analyze abnormal conditions.
According to the analysis results, the corresponding information is pushed to the information source management, acquisition personnel and other relevant personnel, so as to optimize the acquisition strategy and collector, so as to achieve the acquisition closed-loop (acquisition feedback optimization acquisition acquisition acquisition).
1) Related rules:
Purpose of design data verification rules:
① In order to ensure the data quality of incoming products.
② In order to find the shortcomings of the acquisition, in order to optimize the acquisition strategy, improve the collector.
The ultimate goal is to improve the user experience of data products and enhance user stickiness.
Verification rule description
Only information with non empty title can be used for data quality verification, data correction and other subsequent operations.
Rule classification description
① Non empty verification rule; type = 1
② Data quality verification rule; type = 2
③ Data cleaning rule; type = 3
④ Rules of secondary weight exclusion. type=4
Non empty rule
Specific judgment rules shall be subject to the configuration in index field management. Including but not limited to the following aspects:
① Whether the title / comment is not empty;
② [release time]( http://www.blog2019.net/tag/%E5%8F%91%E5%B8%83%E6%97%B6%E9%97%B4?tagId=31 )Whether it is not empty;
③ Whether the content is not empty;
④ Whether the acquisition time is not empty;
⑤ Whether the insertion time is not empty;
⑥ Whether the data type is not empty;
⑦ If it is blank, it is impossible to determine which type of verification rule is used for the data;
⑧ Kafka judges which es index to store the data according to the value of this field;
⑨ Whether the collector ID ID ID is not empty.
⑩ The design of this field mainly uses the record data source to locate people quickly.
⑪ Site in news or website type data_ ID and site_ Name cannot be empty;
Notes:
Only information whose title / comment content is not empty can flow into the subsequent environment.
Data quality rules
1) Title:
① Whether the title is garbled;
② Whether date information appears. For example:
③ Whether the title is "XXXXXX"_ End of "XXX website / portal website";
④ Whether the title contains JS, CSS style, etc.
⑤ Whether the title contains HTML escape characters, etc.; for example: & nbsp;
⑥ Contains a special format. For example:
wait
2) Body / comments:
① Whether there is garbled code;
② Whether JS and CSS are included;
③ Whether it contains useless content. For example: open app, view more, wonderful pictures, expand full text, scan, scan code, etc;
④ Whether the content is consistent with the title description;
⑤ Whether the content contains escape characters, etc.; as shown in the following figure:
⑥ Contains special repeating formats. For example: multiple "
”, new line, etc. appear at the same time ⑦ whether the copyright information is included. Such as: exclusive manuscript, prohibition of reprint, etc
3) Published:
① Whether it is greater than the acquisition time;
② Whether the length is 19 bits;
③ Whether the format is: yyyy MM DD HH: mm: SS.
Data cleaning rules
1) By domain name
Filter the data of the whole website according to the domain name. This situation is mainly to deal with the abnormal situation of some domain names
Situation. For example, some websites actually jump to some gambling websites, etc
2) Specific cleaning according to key words
① Delete the whole row containing a keyword.
② Delete a specific keyword;
Secondary weight exclusion rules
1) Weight by field.
It can be a single field or a combination of multiple fields. For example, WeChat uses the two row of "official account name" + "title".
2) Uniqueness of weight exclusion rules
One type of data can only be set with one weight exclusion rule;
2) System function design
Rule base management
① Add, edit, delete and query non empty rules; type = 1
② Add, edit, delete and query data quality verification rules; type = 2
③ Add, edit, delete and query data cleaning rules; type = 3
④ The default rules refer to the non empty rules, data quality rules and data cleaning rules in the related rules description;
⑤ You need to add corresponding judgment keywords under the rules.
Kafka unified push interface management
① It can realize the management of adding, editing, deleting and querying existing interfaces. At the same time, the information is synchronized to the redis database. The format is shown in the figure in ③;
② The interface service record must include: [deployment]( http://www.blog2019.net/tag/%E9%83%A8%E7%BD%B2?tagId=13 )Server IP, interface port number and other information;
③ Because of the distributed deployment, each interface will have multiple address URLs, as shown in the following figure.
④ Interface storage method in redis: key = method name, value = concatenated string
Data type management
Data type management
① It can add, edit, delete and query data types;
② This function needs to be able to add a weight exclusion basis field, which can be a single field or a combination;
③ This function needs to be able to select the address URL of the data push interface, and associate through the interface method name;
④ Data types can be generally divided into: News (or website), forum, blog, microblog, paper media, foreign media, client, wechat, video, radio, TV station, comments, etc;
⑤ The data type needs to be associated through es index type;
⑥ Data types need to be associated with people.
Index field management
① The field field in the index library corresponding to each type of data;
② Including but not limited to: field name, type, length, whether it is empty, whether it is constraint key (ES PK);
③ This function can add non empty rules, data quality verification rules and cleaning rules;
Verification rule settings
① The general structure of the function can be displayed in tree mode. The tree node consists of data types and their fields.
When you click a data type, the rules of this type are displayed;
When clicking the field, the set non empty rules, data quality verification rules and cleaning rules will be displayed
② Click different nodes to add relative rules. At the same time, it can edit and delete the set rules;
Collection personnel management
The collector management temporarily uses the user function of the information source system and optimizes it according to the actual needs.
Abnormal data management
1) Fields to be recorded for abnormal data:
① Title (exe_ title)
② Link (exe_ url)
③ Domain name (exe_ Domain): used to count website exceptions;
④ Data type (exe_ group_ ID): used to count the abnormal data of data types;
⑤ Index type (exe_ index_ type)
⑥ Release time (exe_ ptime)
⑦ Collection time (exe_ gather_ time)
⑧ Verification time (exe_ check_ time)
⑨ Exception rule (exe_ rule_ ID): used for statistics by rules;
⑩ Collected by (exe_ crawler_ ID): used to count the quality of data collected by each collector;
⑪ Collector ID (exe_ gahter_ ID): used to quickly locate the collector;
2) With editing push function
If the data and business fit well, or the negative information, in order to ensure the timeliness of the data, it needs to be edited manually (the operation and maintenance optimization process takes time, which will affect the timeliness of the data), and pushed to Kafka to improve the user experience of the product.
① It can have some statistical functions to analyze and summarize some regular problems.
1) Be able to make statistics according to verification rules
2) Be able to make statistics according to the collection personnel;
3) Be able to make statistics by website;
4) Be able to make statistics by data type;
3) Redis rule base design
Storage format
In order to improve the performance of data quality verification and weight removal, the data type is used as the key. All applicable rules of this type, including non empty rules, data quality verification rules, cleaning rules, and secondary weight removal rules, are combined into a JSON string as value and saved in the redis rule library.
In order to ensure uniform processing, the format can be the same as the storage format of Kafka unified interface.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.