The key technology in the ETL of BI that little thing
ETL (Extract/transformation/load) is the core and soul of BI/DW, integrating and improving the value of data in accordance with unified rules, is responsible for the completion of data from the data source to the target Data Warehouse transformation process, is the implementation of the data warehouse important steps.
The main link in ETL process is data extraction, data conversion and processing, data loading. In order to implement these functions, the various ETL tools will generally perform some functional expansion, such as workflow, scheduling engine, rule engine, scripting support, statistics and so on.
Data extraction
Data extraction is the process of extracting data from a data source. In practical application, the data source is mostly used in relational database. There are several ways to extract data from a database.
(1) full -volume decimation is similar to data migration or data replication, which extracts the data from a table or view in the data source intact from the database and transforms it into a format that can be recognized by its own ETL tools. Full-volume extraction is relatively simple.
(2) The incremental decimation Increment extraction extracts only the data that was added or modified in the tables that were extracted from the database since the last extraction. In the ETL use process. Incremental decimation is more widely used than full-scale extraction. How to capture changing data is the key to incremental extraction. There are two requirements for the capture method: accuracy, the ability to accurately capture the change data in the business system at a certain frequency; performance, can not cause too much pressure on the business system, affect the existing business. The methods of capturing change data commonly used in incremental data extraction are:
A. Trigger: To create the required trigger on the table to be extracted, generally to establish insert, modify, delete three triggers, whenever the data in the source table changes, the corresponding trigger will change the data into a temporary table, the extraction thread from the temporary table to extract data, temporary table in the extracted data is marked or deleted. The advantage of the trigger method is that the performance of the data pump is high, the disadvantage is that it requires the business table to set up the trigger and has some influence on the business system.
B. Timestamp: It is a method of capturing the change data based on snapshot comparison, adding a timestamp field to the source table, and modifying the value of the timestamp field while updating the table data in the system. When data extraction is performed, the data that is extracted is determined by comparing the values of the system time and timestamp fields. Some database timestamps support Automatic Updates, that is, when the data for other fields in the table changes, the value of the timestamp field is automatically updated. Some databases do not support automatic timestamp updating, which requires the business system to update the timestamp field manually when updating the business data. As with the trigger mode, the time stamp mode performance is also better, the data extraction is relatively clear and simple, but the business system also has a large pour (add additional timestamp field), especially for the automatic update of the database does not support the timestamp, also requires the business system to perform additional update timestamp operations. In addition, delete and update operations on previous data of timestamps cannot be captured and are subject to certain limitations in data accuracy.
C. Full-scale comparison: A typical full-table alignment is the use of the MD5 check code. The ETL tool establishes a similar MD5 temporary table in advance for the table to be extracted, which records the primary key of the source table and the MD5 checksum computed from the data for all fields. Each time the data is pumped, the source table and the MD5 temporary table are compared to the MD5 checksum, which determines whether the data in the source table is new, modified, or deleted, and the MD5 checksum is updated. The advantage of the MD5 method is that it is less inclined to the source system (only need to set up a MD5 temporary table), but the disadvantage is obvious, unlike the active notification in the trigger and timestamp mode, the MD5 mode is passive for the whole table data, and the performance is poor. The MD5 method is less accurate when there are no primary keys or unique columns in the table and duplicate records are included.
D. Log comparison: The data is judged by analyzing the log of the database itself. Oracle's change data Capture (cdc,changeddatacapture) technology is representative of this. The CDC feature was introduced in the Oracle9i database. CDC can help you identify data that has changed since the last extraction. With CDC, data can be extracted at the same time as an INSERT, update, or delete operation on the source table, and the changed data is stored in the database's change table. This allows you to capture the changed data and then make it available to the target system in a controlled manner using the database view. The CDC architecture is based on the publisher/subscriber model. The publisher captures the change data and provides it to subscribers. Subscribers use the change data obtained from the publisher. Typically, the CDC system has one publisher and multiple subscribers. The publisher first needs to identify the source tables that are required to capture the change data. It then captures the changed data and saves it in a specially created change table. It also enables subscribers to control access to changing data. Subscribers need to be aware of what change data they are interested in. A subscriber may not be interested in all the data published by the publisher. Subscribers need to create a subscriber view to access change data that is authorized by the Publisher. CDC is divided into synchronous mode and asynchronous mode, the synchronous mode captures the change data in real time and stores it in the change table, both the Publisher and the subscription are in the same database. Asynchronous mode is Oracle-based streaming replication technology.
ETL Processing data Sources in addition to relational databases, may also be files, such as txt files, Excel files, XML files and so on. The extraction of the file data is generally carried out, the time stamp of the file can be saved before one extraction or the MD5 check code of the calculation file, the next time the extraction, if the same can ignore the extraction.
Data conversion and processing
Data extracted from the data source does not necessarily fully meet the requirements of the destination library, such as inconsistent data formats, data entry errors, incomplete data, and so on, it is necessary to the extracted data for data conversion and processing. Data conversion and processing can be done in the ETL engine, or in the process of data extraction using the characteristics of the relational database at the same time.
(1) Data conversion and processing in the ETL engine
The ETL engine typically implements data transformations in a modular way. Common Data transformation components include field mapping, data filtering, data cleansing, data substitution, data calculation, data validation, data addition and decryption, data merging, data splitting, and so on. These components are like a process on an assembly line, they are pluggable and can be assembled arbitrarily, sharing data between components through the data bus. Some ETL tools also provide scripting support, allowing users to customize the conversion and processing of data in a programmatic way.
(2) Data processing in the database
The relational database itself has provided powerful SQL, functions to support the processing of data, such as in the SQL query statement to add a where condition for filtering, query rename field name and destination table mapping, substr function, case condition judgment and so on.
Compared to the ETL engine for data conversion and processing, directly in the SQL statement conversion and processing more simple and clear, higher performance. The inability to process SQL statements can be handled by the ETL engine.
Data loading
Loading the transformed and processed data into the destination library is usually the last step in the ETL process. The best way to load data depends on the type of action performed and how much data needs to be mounted. When the destination library is a relational database, there are generally two kinds of loading methods:
(1) Direct SQL statements are INSERT, UPDATE, delete operations.
(2) Use bulk load methods, such as BCP, bulk, and relational database-specific bulk loading tools or APIs. In most cases, the first method is used because they are logged and recoverable. However, bulk load operations are easy to use and are more efficient when loading large amounts of data. Which data load method to use depends on the needs of the business system.
Common Data quality issues
Data quality |
Problem |
Data Problem Example |
Format |
Does the value conform to a consistent format standard? |
Phone # may appear as xxxxxxxxxx, (XXX) xxx-xxxx, 1.xxx.xxx.xxxx, and so on. |
Standard |
Are data elements consistent in definition and understanding? |
a System gender code = M, F, U , another system gender code = 0, 1, 2 |
Consistency |
Does the value represent a unified meaning? |
Is turnover always shown as USD or is it possible? |
Integrity |
Are all the required data included? |
20% the last name of the customer is empty, 50% 's zip code is 99999 |
Accuracy |
Does the data accurately reflect realistic or verifiable data sources? |
the supplier shows as ' activity ', but in fact 6 years ago it had no business dealings with it. |
Effectiveness |
is the data value within the accepted range? |
The salary value should be 60,000-120,000. |
Repeatability of |
Data appears multiple times |
John Ryan and Jack Ryan . It's all in the system. – are they the same person ? |
The key technology of "bi thing" ETL