In the era of data explosion, we are generating data all the time; how to extract valuable things from massive data requires us to have a pair of fire eyes; for a data worker, it is to data Perform cleaning and integration.
There are three steps:
Data extraction, conversion and loading
Data cleaning
Data integration
Below we will explain separately from these three aspects.
Data extraction, conversion and loading
Databases for data service (or OLAP) applications generally do not run particularly complex data analysis tasks. In order to analyze the data, we extract, transform and load (Extract, Transform and Load, ETL) data from these databases into the data warehouse, and then run complex analytical loads on it, including OLAP analysis and data mining, from Mining patterns and knowledge in data.
In this process, if the ETL data from multiple heterogeneous data sources into the data warehouse, and these data sources have various heterogeneity and inconsistency, then the data needs to be integrated. Therefore, data integration is a technique to establish a unified data source from multiple data sources.
When performing ETL operations, if the data quality of the data source is poor, you need to use data cleaning technology to solve the data quality problem when performing data conversion. If the data quality is guaranteed, there is no need to wash the data. The data conversion operation is relatively simple, such as simple normalization.
Data cleaning
What is data cleaning?
Data cleaning is a technique to eliminate errors in data and remove duplicate data. It can be integrated in the ETL process, play a role in the process of establishing a data warehouse from a data source, or it can be run directly on a database. After the data is cleaned, it is finally saved in the original database.
The meaning of data cleaning
Based on accurate data (high quality) analysis, it is possible to obtain credible analysis results, based on these results it is possible to make correct decisions, otherwise, analysis on inaccurate data may lead to wrong understanding And decision-making. It is estimated that the anomalies and impurities in the data generally account for about 5% of the total data.
Data cleaning is often associated with
data integration. When data integration is performed from multiple data sources, data cleaning techniques are used to eliminate errors in the data in order to obtain high-quality data.
Different types of data anomalies
The purpose of data cleaning is to remove anomalies in the data. Anomalies in the data can be divided into three categories: Syntactical (Syntactical), Semantic (Semantic), Coverage Anomaly (Coverage Anomaly)
Syntax exception
Refers to errors that indicate specific data values and formats of entities. This type of exception can be divided into three types:
Lexical Error
Refers to the inconsistency between the actual data structure and the specified structure.
For example: in a person table, each entity has four attributes, namely name, age, gender, and height, and some records have only three attributes, which is a syntax exception.
Domain Format Error
It means that the value of an attribute of an entity does not conform to a certain format in the expected value range. The value range is a collection of all possible values of the data.
For example: the name is a string type, there is a "," between the first and last name, then "John, Smith" is the correct value, "John Smith" is not the correct value.
Irregularity (Irregularity)
Refers to the use of values, units and abbreviations are not uniform and not standardized.
For example, some employees' salary fields use "yuan" as the unit, and some use "ten thousand yuan" as the unit.
Semantic exception
Refers to entities whose data cannot represent the objective world in a comprehensive and non-repetitive manner. There are four types of such anomalies:
Violation of integrity constraints (Integrity Constraint Violation).
Refers to a tuple or several tuples do not meet (entity integrity, referential integrity and user-defined integrity) integrity constraint rules.
For example: we stipulate that the wage field of the middle worker must be greater than 0. If the salary of an employee is less than 0, the integrity constraint rule is violated.
Contradiction appears in the data.
Refers to the value of each attribute of a tuple, or the value of each attribute of a different tuple violates the dependency relationship of these values.
For example: the bill amount in our bill table is the total amount of goods minus the discount amount, but the actual paid amount of a bill in the database is not equal to the total amount of goods minus the discount amount, which is contradictory.
Duplicate of data.
It means that two or more tuples represent the same entity.
Invalid tuple.
Refers to certain tuples that do not represent valid entities of the objective world.
For example, there is an employee in the employee table named "Wang Chao", but there is no such person in the company.
Coverage exception
Missing Value. Refers to the fact that no corresponding data was collected during data collection.
Missing Tuple. It refers to the existence of certain entities in the objective world, but they are not represented by tuples in the database.
Data quality
Data quality is a relatively broad concept, and it contains many aspects; the evaluation standards for data quality can be organized into a hierarchical structure, and the score of the upper-level data quality standard is a comprehensive weighted score of its sub-standard scores.
Correctness. Refers to the ratio of all correct values to all values in the data set
Integrity (Integrity) can be divided into: completeness and effectiveness;
Consistency (Consistency) can be divided into: model conformity and unity;
Density refers to the proportion of missing values in all attributes that should exist in all tuples.
Uniqueness. Refers to the proportion of repeated tuples representing the same entity in all tuples in the data set.
Data cleaning tasks and processes
Data cleaning is to remove the abnormalities in the data, so that the data is integrated into an accurate and non-repetitive representation process in the real world.
The process of data selection can be divided into four main steps:
Data audit;
Define the data cleaning workflow;
Perform data cleaning workflow;
Subsequent processing and control;
The specific method of data cleaning
data analysis;
Data conversion
Implement integrity constraints;
Data deduplication;
Some statistical methods;
Data integration
What is data integration?
In many applications, people need to integrate data from different sources in order to obtain effective analysis results, otherwise, incomplete data will lead to inaccurate analysis results.
Data integration refers to the process of integrating data from multiple data sources to provide a unified view of these data.
Problems to be solved by data integration
The first problem to be solved by data integration is the heterogeneity (Heterogeneity) between different data sources.
The heterogeneity between data sources is reflected in the following aspects:
Heterogeneity of data management system;
Communication protocol heterogeneity;
Heterogeneity of data patterns;
Heterogeneity of data types;
Value heterogeneity;
Semantic heterogeneity;
Data Integration Model
We have three basic strategies for data integration, namely Federated Database, Data Warehousing, and Mediation.
1. The Federal Database Model
The federated database is the simplest data integration mode. It needs to create mapping and transformation software between each pair of data sources. This software is called a wrapper. When the data source X needs to communicate and integrate with the data source Y, it is necessary to establish a wrapper between X and Y.
Advantages: If we have many data sources, but only need to communicate and integrate between a few data sources, the federal database is the most cost-effective mode;
Disadvantages: If we need to communicate and exchange data between many data sources, we need to create a large number of Wrapper; in the case of n data sources, at most (n(n-1))/2 Wrapper, This will be very heavy work. If there is a change in the data source, it is necessary to modify the mapping and conversion mechanism to update a large number of Wrapper, which becomes very difficult.
2. Data Warehouse Mode
Data warehouse is the most common data integration model. In the data warehouse model, data is copied from various data sources, converted, and then stored in a target database.
ETL: It is the abbreviation of Extract, Transfrom, Load. The ETL process is completed outside the data warehouse. The data warehouse is responsible for storing data for query.
In the data warehouse mode, the data integration process is an ETL process, which needs to solve the heterogeneity and inconsistency between various data sources.
In the data warehouse model, the same data is copied in two parts: one in the data source and one in the data warehouse. How to update the data in the data warehouse in time will be a problem that I need to consider (full update, incremental update).
3. The intermediary model
The mediator plays the role of the virtual view of the data source. The mediator itself does not save the security data, and the data is still stored in the data source. The intermediary maintains a virtual data schema (Virtual Schema), which combines the data schemas of various data sources.
Data mapping and transmission only really happen at the query time (Query Time).
When a user submits a query, the query is converted into several queries on various data sources. These queries are sent to each data source, and each data source executes these queries and returns the results. The results returned by each data source are merged (Merge) and returned to the end user.