Data Quality Monitoring of Python Crawler Series (1)

Source: Internet
Author: User
Keywords python data quality data quality monitoring
Overview
1. Current situation
Recently, in SaaS platform, app and other products, there are always various problems in the collected data, such as the title parsing into JavaScript code, or including a useless character, or a garbled string.
The disadvantages of the previous monitoring mechanism seem to be more and more serious, which can no longer meet the needs of data monitoring.
With the increasing number of data types, customized collection scripts, and personnel involved, the collection difficulty is increasing, and various problems appear frequently.
In order to develop a system that can truly monitor the data quality in real time, locate problems quickly, feed back in time and iterate the collector or script quickly, a layer of centralized monitoring is added at the data push interface on the basis of the original decentralized monitoring.
2. Advantages and disadvantages
Decentralized monitoring means that each collector or script monitors the quality of data by itself. But sometimes because of the urgent task, or to save time, there is no monitoring module added at all.
Centralized monitoring refers to the processing of data quality and weight removal at Kafka unified push interface;
1) Advantages and disadvantages of decentralized monitoring:
(1) Advantages
① It can reduce the pressure at the unified push interface and shorten the time of data entering Kafka;
② Reduce the frequency of interface exceptions;
(2) Disadvantages
① The relevant personnel may modify the monitoring indicators, resulting in confusion, unable to achieve the data quality monitoring effect, and unable to locate the problem;
② Because the task is urgent, or in order to save time, there is no monitoring and scheduling mechanism added at all, resulting in a large number of data duplication and poor quality data, affecting es performance, and seriously affecting the user experience of the product.
③ Waste of resources. Because each collector or customized script needs to consider the monitoring problem, there are many repetitive work and increased labor cost;
④ Product iterations are slow. Laziness is human nature. Without the supervision of process and mechanism, most people will deal with problems in the most convenient way. I even think it's a small problem. It doesn't matter. I just drag it and forget it.
2) Advantages and disadvantages of centralized monitoring:
(1) Advantages
① Reduce the waste of human resources;
② Unified and standardized monitoring mechanism;
③ Advance the abnormal problems to improve the user experience of the product;
④ Reduce human risk.
⑤ According to the monitoring results, supervise the relevant personnel to iterate the products continuously through the process and designation.
⑥ For managers, they can understand the problems existing in the collection of each part in real time, think about the whole situation and optimize the collection strategy.
⑦ According to the monitoring results, to a certain extent, it can provide managers with the basis for performance appraisal.
(2) Disadvantages
① The complexity of unified interface logic processing is increased, and the probability of exception is increased;
② The speed of data processing is reduced. Comprehensive consideration, in the acceptable range, or to meet the needs, can be temporarily ignored.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.