Data quality management-data extraction and cleaning

Source: Internet
Author: User

Web Data integration technology can automatically get data from the Web, but there is a lot of dirty data, such as misuse of abbreviations, idioms, data input errors, duplicate records, lost values, spelling changes, different units of measurement. This data is meaningless, and it is impossible to provide any support for future data mining decision analysis.

Data cleaning is mainly to improve the availability of data, currently, data cleaning is mainly used in three areas:
1 Data Warehouse (DW)
2 Knowledge discovery in the database (KDD)
3 Data Quality Management (TDQM)
My first project in the company was data quality management, and here's the data quality management:
Through the development and implementation of data quality inspection, to expose the system data quality problems. Continuously monitor the data quality fluctuation of each system and the data quality rule ratio analysis, generate the key data quality report of each system regularly, and master the system data quality status. Combining the cleaning components provided by the system and the data quality problem processing process provide effective support for the data quality improvement of each system.

Data quality (dataquality) management is the entire process throughout the data lifecycle, covering quality assessment, data de-noising, data monitoring, data exploration, data cleansing, and data diagnosis. Data metrics and frequency of change provide a means to measure the quality of data. Data metrics mainly include completeness, uniqueness, consistency, accuracy, and legitimacy. The change frequency mainly includes the change period of the business system data and the refresh period of the Entity data. Data quality management guidelines include methods for measuring, improving the quality and integration of organizational data. Data quality processing includes data normalization, matching, survival, and quality monitoring. Data must have the right quality to address business requirements.
Based on the reference frame of big data and the actual demand of data processing, the main functions of the quality Management system are: Data discovery, quality management, metadata, master data management and information policy management.

In the data lifecycle, data acquisition and usage cycles include series of activities: evaluation, analysis, tuning, discarding of data,

Current model of data cleansing:
Data cleaning based on rough set theory
Data cleansing based on poly mode
Cleaning model based on fuzzy matching data
Data cleaning based on genetic neural network
Based on expert system architecture, etc.

Data validation and conversion
The purpose of data validation is to ensure the correctness and completeness of the extracted data itself,
The purpose of data transformation is to ensure consistency of data

Data cleansing Process

1 data preprocessing: Including data element, categorization malleability
2 Determine the cleaning method:
3 Calibration Cleaning Method: First Verify that the cleaning method used is appropriate, sampling small samples for verification, to determine its recall rate and accuracy rate
4 Execute the Cleaning tool:
5 Data archiving: Archive old and new data sources for easy cleaning

In general, the metadata of the reaction in the pattern is not enough to judge the quality of a data source, so it is important to obtain metadata about the data familiarity and unusual patterns through concrete examples. These metadata can help uncover data quality problems and also help you identify dependencies between attributes.

1 Data analysis
Two methods of data analysis;
Data derivation: an instance of a single property is analyzed primarily. Data derivation can get a lot of information about properties, such as data type, length, value space, discrete values, their frequency of occurrence and the number of different values, through the application of statistical techniques, you can get the average value between the attributes, intermediate values, etc.
Data mining: Helps discover specific patterns of data in large datasets, and data mining can be used to discover some of the integrity constraints between attributes such as function dependencies and business rules.

2 defining cleaning transformation rules and workflows
A large number of data conversion and cleaning steps are required depending on the extent of inconsistent data and "dirty data" in the data source
3 Verification
The correctness and efficiency of the defined cleaning rules and workflows should be validated and evaluated, and the real data cleansing process needs to be iterative for analysis design and validation.

4 Errors in cleaning data
Note Back up the source data first,
5 Clean Data reflow
Clean data replaces the original "dirty data" in the data source

Data Cleansing Framework
A field-independent data Cleansing framework
Metadata refers to "data about data", which is related to the data source definition, target definition, conversion rules and other relevant key data during the data cleansing process, and the metadata includes the following components in the data cleansing process:
1 Basic components: mainly describes the characteristics of metadata, including: can provide metadata database name, database number, database table and table number, attributes and attributes in the table number.

2 Cleaning Rule components: Data quality rules define quality issues and data cleansing rules in metadata, including error data tables

3 Data Loading components: used to determine when and what data is loaded into the destination database for heterogeneous metadata
An additional three workflows:
(1) Data analysis Flow (2) Data Cleansing Workflow (3) Cleanup result validation workflow

B. Domain-based knowledge-related data cleansing framework
Knowledge-based data cleansing framework, which extracts and validates knowledge from sample data under the guidance of domain knowledge, then cleans the whole data through the expert system engine
1 rule generation stage: first generate a sample data set, sample data set is extracted from the entire database of a small sample, on the basis of the participation of experts to generate a rule base, after getting the preliminary rules, apply them to the data set, observe the intermediate results, further modify the rules, in this process, Can be based on machine learning and statistical techniques to help solve.

2 preprocessing phase: Correct all the anomalies we can detect based on the generated preprocessing rules, basic preprocessing includes: Data type detection, data format standardization, resolving data inconsistency

3 processing phase: Data will then flow into the expert engine system, typical rules include dirty data detection rules, duplicate data detection, more formal error data

4 Data Loading phase: Load the cleansed data into the destination database via data loading rules

Data Frame Cleaning Design

Here's a little bit of a nosql note.
Hypertable's goal is to solve large concurrency, large data volume of the database requirements, can handle a large number of concurrent requests, management of large amounts of data, scalability is good.

Data quality management-data extraction and cleaning

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.