Before data analysis-data quality

Source: Internet
Author: User

What is data quality?

Recently, data analysis is a hot topic. Traditionally, Data Analysis is divided into two types: EDA (Exploratory Data Analysis) and CDA (Confirmatory Data Analysis, validation Data Analysis ). EDA focuses on making data speak on its own, and CDA usually has a preset model before analysis.

In fact, the focus of data analysis and data mining is not on the data itself, but on how to truly solve practical business problems in data operations. However, to solve business problems, data must be analyzed and mined to generate value for data. Before data analysis and data mining, we must first ensure high-quality data and complete data quality processing, that is, data integration and processing. Therefore, better data means better decision-making. Otherwise, it is GIGO-Garbage in and Garbage out.

Therefore, the premise of data analysis is to ensure data quality.

What should we do with data quality?

The traditional data quality work mainly involves data integration and data cleansing, and focuses on raw data and metadata.

I. Data Integration

Data Integration mainly solves the problem of information islands, including two aspects:

1) Data Warehouse integration of source data.

2) The metadata system integrates metadata from different data sources.

Correspondingly, data quality management also focuses on two aspects:

1) quality exploration and analysis of real data in the data warehouse.

2) Check the data quality of metadata in the metadata system.

Ii. Data cleansing

Data quality processing mainly uses some data cleansing rules (DataCleansing) to process missing data, remove duplicate data, remove noise data, and process abnormal (but real) data, this ensures data quality, such as integrity, uniqueness, consistency, accuracy, legitimacy, and timeliness.

Metadata Management aims to integrate enterprise information assets and support transparent and visualized data usage, improving the reliability of data reporting, data analysis, and data mining, therefore, the metadata data quality check focuses on the uniqueness, consistency, and accuracy of metadata information.

How to implement data quality

I. Difficulties in data quality

Till now, many people are still not fully aware of the importance of data quality for the following reasons:

1) the data quality problem has not yet been serious to the assessment that affects its core KPI.

2) data delivery teams or data application teams pick out their data quality problems, which can easily be kicked off and shirk their responsibilities, because data quality problems are often the result of comprehensive problems in many links. Many people will think that the introduction of data quality will put an end to their work.

3) Data Quality teams often work from the perspective of monitoring and supervision, without improving data quality from the perspective of Data users' own values, it helps data users to better obtain value from data governance, improve work efficiency, increase the authority and credibility of work, and directly bring business value to data users, this allows more data-related personnel to take the initiative to participate in data quality. Therefore, many people are talking about data quality, but few are willing to take practical actions.

Ii. steps for improving data quality

When risk is not a critical issue, setting up our risk analysis team is also an important precaution for enterprises. The management leadership of the enterprise data department must reach a consensus that a comprehensive data quality solution can bring great value to the company. On this basis, we will improve the internal data quality through planned steps:

The first step is to carry out data quality discussions within the enterprise scope, take into account the company's goals and interests of all parties, and form the objectives, policies, strategies, and steps of data quality management. Within the enterprise scope, at least a broad consensus can be reached within the data management and data governance teams.

Step 2: Establish internal responsibilities and data quality policies of enterprises, and establish an evaluation system for economic impact and high-quality data value of inferior data.

Step 3: establish an open data quality management system and take data quality work under the responsibility of a single data management team, converts to the participation of all data providers, data processors, data users, and other data stakeholders within the company. Data-related personnel will focus on the data operation panorama and data quality heatmap of data operation processes such as data quality and data security, just as drivers are concerned about real-time traffic conditions, in addition, you can easily see the correlation with your own responsibilities from the heat map and participate in the handling in a timely manner.

Challenges to data quality in the big data age

I. Do we still need to pay attention to data quality in the big data era?

In the relational database era, we can use technologies related to data warehouses and business intelligence to complete data integration, data analysis, and data presentation. But as we all know, in the big data era, data has the following 4 V characteristics:

1)Volume: QuantityHuge data volume, from TB to PB level

2)Variety: TypeDiverse data types, including structured, unstructured, and semi-structured data

3)Velocity: SpeedThe processing speed is fast. The 1-second law is essentially different from that of traditional data mining.

4)Value: ValueLow value density and high commercial value

Ii. How to implement data quality in the big data age

Positioning of the data quality team

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.