Introduction: Etl,extraction-transformation-loading's abbreviation, the process of data extraction (Extract), Transformation (Transform), loading (load), is an important part of building a data warehouse.
Keywords: ETL Data Warehouse OLTP OLAP
The etl,extraction-transformation-loading abbreviation, the process of data extraction (Extract), Transformation (Transform), and load (load), is an important part of building a data warehouse.
ETL is the process that the data of business system is pumped into the data warehouse after being cleaned and transformed, the purpose is to integrate the data of the enterprise in the scattered, messy and standard, and provide the analysis basis for the enterprise's decision. ETL is an important part of BI project. Typically, in BI projects ETL will spend 1/3 of the time of the project, ETL design is directly linked to the success or failure of the BI project.
In the construction of Data Warehouse, ETL runs through the project always, it is the lifeline of the whole data warehouse, including the process of data cleaning, integration, transformation, loading and so on. If the data warehouse is a building, then ETL is the foundation of the building. ETL extraction integration of data directly affects the final results of the presentation. So ETL plays a very important role in the whole Data Warehouse project, and it must be placed in a critical position.
With ETL, we can generate data warehouses based on the data in the source system. ETL constructs the bridge between OLTP system and OLAP system, and it is the channel that data flows from source system to data Warehouse. In the project implementation of the Data Warehouse, it relates to the data quality of the whole project, so sloppy, must put it to the important position, the Data Warehouse the foundation of this building to build a prison!
ETL mainly utilizes the processing ability of the conversion server, after extracting the data from the source table, the data is cleaned and transformed in the conversion server, and then loaded into the target library after completion. Its conversion process is done in the conversion server, so its performance bottleneck in the middle of the transformation server.
ETL design is divided into three parts: data extraction, data cleaning and transformation, data loading. In the design of ETL, we also start from these three parts. The extraction of data from various data sources to the ODS (Operationaldatastore, operational data storage)-This process can also do some data cleaning and conversion, in the process of extraction needs to select different extraction methods, as far as possible to improve the efficiency of ETL operation. ETL three parts, takes the longest time is "T" (Transform, cleans, transforms) the part, generally this part of the workload is the entire ETL 2/3. Data loading is usually done directly to the DW (datawarehousing, Data Warehouse) after the data has been cleaned.
ETL implementation has a variety of methods, commonly used there are three kinds. One is the use of ETL tools (such as Oracle's OWB, SQLServer2000 DTS, SQLServer2005 SSIS Services, informatic, etc.) implementation, one is the SQL implementation, the other is the ETL tool and SQL combination. The first two methods have their own advantages and disadvantages, with the help of tools can quickly set up ETL project, shielding the complex coding tasks, improve the speed, reduce the difficulty, but lack of flexibility. The advantage of SQL method is flexibility, improve ETL running efficiency, but the coding is complex, the technical requirements are higher. The third is to synthesize the advantages of the preceding two, which will greatly improve the development speed and efficiency of ETL.
First, the extraction of data
This part needs to do a lot of work in the research phase, first of all to understand the data is from several business systems, the various business Systems Database server running what DBMS, whether there is manual data, how much manual data, whether there is unstructured data, etc. When this information is collected, data extraction can be designed.
1. The same data source processing method as the database system storing DW
This type of data source is relatively easy to design. In general, the DBMS (SQL Server, Oracle) provides a database link function that establishes a direct link between the DW database server and the original business system and can be accessed directly by writing a SELECT statement.
2, for the DW database system different data source processing method
For this type of data source, database links, such as SQL Server and Oracle, can generally be established through ODBC. If you cannot establish a database link, you can do it in two ways, by exporting the source data to a. txt or. xls file through a tool, and then importing the source system files into the ODS. Another way to do this is through a program interface.
3. For file type data sources (. txt,.xls), you can train business people to use database tools to import the data into the specified database and then extract it from the specified database. Alternatively, tools can be implemented, such as SQLServer2005 's flat data source for SSIS services and planar targets, to be imported into the ODS.
4. Issues with incremental updates
For systems with large data volumes, incremental extraction must be considered. In general, the business system will record the time the business takes place, we can use it to make an incremental flag, determine the maximum time recorded in the ODS before each extraction, and then go to the business system at this time to take all the records that are greater than this time. With the time stamp of the business system, in general, the business system has no or part of the time stamp.
Second, the data cleaning conversion
In general, the data Warehouse is divided into two parts: ODS and DW. It is common practice to clean from the business system to the ODS, to filter out dirty and incomplete data, to convert from ODS to DW, to calculate and aggregate business rules.
1. Data cleaning
The task of data cleansing is to filter out the data that does not meet the requirements, and give the result of the filter to the business competent department to confirm whether it is filtered or modified by the business unit before extracting it. The non-conforming data mainly consists of incomplete data, wrong data and repeated data.
(1) Incomplete data: This type of data is mainly some of the information should be missing, such as the name of the supplier, the name of the branch, the customer's regional information is missing, the main table in the business system does not match the schedule. For this kind of data filtering out, according to the missing content written to different Excel files to submit to the customer, required to complete within the specified time. The Data warehouse is not written until the completion is complete.
(2) Wrong data: This type of error occurs because the business system is not sound enough, after receiving input is not judged directly to the background database, such as numeric data output complete corner numeric characters, string data followed by a carriage return operation, date format is incorrect, date out of bounds and so on. This type of data is also classified, for similar to full-width characters, the data is not visible before and after the problem, can only be written in the form of SQL statements to find out, and then ask the customer after the business system is modified after the extraction. Date format is incorrect or the date out of bounds of this kind of error will cause the ETL run failure, this kind of error needs to go to the Business System database by SQL to pick out, to the business departments to request the deadline correction, corrected and then extracted.
(3) Duplicate data: For this type of data-especially in a dimension table-export all the fields of a repeating data record to be confirmed and collated by the customer.
Data cleansing is a recurring process that cannot be completed within a few days, only to constantly identify problems and solve problems. For filtering, whether the correction is generally required to confirm the customer, for the filtered data, write to Excel file or write filtering data to the data table, in the early stage of ETL development can be sent to business units every day to filter the data of the mail, prompting them to correct the error as soon as possible, but also can be used as the basis for future verification data. Data cleaning needs to be noted is not to filter out useful data, for each filtering rule carefully verified, and to be confirmed by the user.
2. Data conversion
The task of data transformation is mainly inconsistent data conversion, data granularity conversion, and some business rules calculation.
(1) Inconsistent data conversion: This process is an integrated process, the same type of data of different business systems unified, than the same as a vendor in the billing system is XX0001, and in the CRM code is YY0001, so that after extracting the unified into a code.
(2) Data granularity conversion: The business system generally stores very detailed data, and data warehouse data is used for analysis, do not need very detailed data. In general, business system data is aggregated according to the granularity of the Data warehouse.
(3) The calculation of business rules: different enterprises have different business rules, different data indicators, these indicators are sometimes not simple Gaga minus minus can be completed, this time need to be in the ETL data indicators are stored in the Data Warehouse, for analysis use.
The concept of ETL learning notes (Part One)
Date:2009-9-7 Fengbuyu I want to comment .
Big| Medium |Small
Guidance: ETL logs are divided into three categories. One is the execution of the process log, this part of the log is in the ETL execution process every step of the record, record each run each step of the start time, affecting the number of rows of data, Journal form.
Keywords: ETL Data Warehouse OLTP OLAP
Third, ETL log, warning sent
1. ETL Log
ETL logs fall into three categories. One is the execution of the process log, this part of the log is in the ETL execution process every step of the record, record each run each step of the start time, affecting the number of rows of data, Journal form. One is the error log, when a module fails to write the error log, record the time of each error, the wrong module, and the error message. The third type of log is the overall log, which records only the ETL start time and the end time for successful information. If the ETL tool is used, the ETL tool automatically generates some logs, which can also be used as part of the ETL log. The purpose of logging is to be ready to know the ETL running situation, if the error, you can know where the error.
2. Warning Send
If the ETL error, not only to form an ETL error log, and to send a warning to the system administrator. There are many ways to send a warning, which is commonly used to send a message to the system administrator and attach the error message to facilitate the administrator to troubleshoot the error.
ETL is a key part of BI project, is also a long-term process, only to constantly identify problems and solve problems, to make ETL run more efficient, for BI project development to provide accurate data.
Iv. to supplement the understanding of ETL:
1. Data collection:
A) collection location: Different servers, different locations;
b) Collection of data in various forms: different storage files, different storage formats;
2, analysis and transformation of unified data form:
A) analyze and parse different files and different data formats;
b) Convert different format data to a uniform format;
c) merging the collected data;
3. Import unified Format data into the Data warehouse.
If the Data warehouse model design is a building design blueprint, data is brick, then ETL is the process of building a building. The most difficult part of the whole project is user demand analysis and model design, and ETL rules design and implementation is the most workload, accounting for the whole project 60%~80%, which is the general consensus from many practices at home and abroad.
At present, the typical representative of ETL tools are: Microsoft SSIS (replace the original DTS), Informatica, Datastage, Oracle OWB and ODI, in addition, Sun also has a complete set of ETL tools. Open source tools have eclips ETL plugins.
The quality problems of ETL are characterized by correctness, completeness, consistency, completeness, validity, timeliness and accessibility.
In order to better implement ETL, we suggest that users should pay attention to the following points in implementing ETL process:
First, ensure the efficiency of integration and loading, can be distributed analysis and then aggregated;
Second, enhance controllability, actively collect data do not passively push data;
Thirdly, we should develop a process configuration management and standard protocol;
Four, the development of key data standards and non-critical data standards.
Four types of data ETL modes:
1) Full refresh: The Data Warehouse data table contains only the latest data, each load deletes the original data, and then fully load the latest source data. In this mode, the data extractor extracts all the records from the source data, empties the target data table before loading, and then loads all the records. In order to improve the speed of data deletion, the data table is usually emptied with truncate. This mode is used in the current information table of the library in this system.
2) Mirror increment: The record in the source data is updated periodically, but the record includes the record Time field, the record of the data history is saved in the source data, and the ETL can extract the incremental data from the source data to the data warehouse by recording time, and the history of the data will be kept in the data warehouse.
3) Event increment: Each record is a new event, there is no necessary connection between each other, the new record is not a change to the original record value, the record includes the Time field, you can extract the new data from the Time field to load into the database.
4) Mirror comparison: The Data warehouse data has an effective date field to hold historical information about the data, and the source data does not retain history and may be updated on a daily basis. Therefore, you can only compare the new mirrored data with the image of the last loaded data, find the Change section, update the effective end date of the record where the history data was updated, and add the changed data. Most of the source data has a dimension table that holds historical information.
The concept of ETL learning notes