ETL introduction ETL

Source: Internet
Author: User
ETL

TL, short for extraction-transformation-loading. The Chinese name is data extraction, conversion, and loading. ETL tools include: owb (Oracle warehouse builder), Odi (Oracle data integrator), informatic powercenter, aicloudetl, datastage, repository explorer, beeload, kettle, dataspider

ETL extracts data from distributed and heterogeneous data sources, such as relational data and flat data files, to a temporary middle layer for cleaning, conversion, and integration. Finally, it loads the data to a data warehouse or a data set, it is the basis for Online Analytical Processing and data mining.

ETL is often used in data warehouses, but its objects are not limited to data warehouses.

ETL is an important part of data warehouses. It is a necessary step. Compared with relational databases, data warehouse technology does not have a strict mathematical theory basis, and it is more oriented to practical engineering applications. Therefore, from the perspective of engineering applications, data is loaded and processed in a series based on the requirements of the physical data model. The processing process is directly related to experience, at the same time, this part of work is directly related to the data quality in the data warehouse, thus affecting the quality of online analysis and processing and data mining results.

A data warehouse is an independent data environment. You need to import data from the online transaction processing environment, external data sources, and offline data storage media to the Data Warehouse through the extraction process. Technically, ETL mainly involves Association, transformation, increment, scheduling and monitoring. data in the data warehouse system is not required to be synchronized in real time with data in the online transaction processing system, so ETL can be performed on a regular basis. However, the time, sequence, and success/failure of multiple ETL operations are critical to the effectiveness of information in the data warehouse.

ETL (extract-transform-load) is the core and soul of Bi/DW (Business Intelligence, the ability to integrate and increase the value of data according to unified rules is the process of converting data from the data source to the target data warehouse and an important step for implementing the data warehouse. If the model design of the data warehouse is the blueprint of a building and the data is brick, ETL is the process of building. The most difficult part of the project is the user requirement analysis and model design, while the ETL rule design and implementation are the largest workload, accounting for about 60% of the project ~ 80%. This is a general consensus obtained from many practices at home and abroad.

ETL is a process of data extraction (extract), cleaning (cleaning), transformation (Transform), and load (load. It is an important part of building a data warehouse. The user extracts the required data from the data source, cleans the data, and finally loads the data to the data warehouse according to the pre-defined data warehouse model.

Information is an important resource of modern enterprises. It is the basis for enterprises to use scientific management and decision analysis. At present, most enterprises spend a lot of money and time building an online transaction processing OLTP business system and office automation system to record various related data of transaction processing. According to statistics, the data volume is every 2 ~ In three years, it will multiply. These data sets have huge commercial value, and enterprises generally only pay attention to 2% of the total data volume ~ About 4%. Therefore, enterprises still do not maximize the use of existing data resources, which wastes more time and money and loses the best chance to make key business decisions. Therefore, how enterprises use various technical means and convert data into information and knowledge has become the main bottleneck for improving their core competitiveness. ETL is a major technical means. How to select the ETL tool correctly? How to correctly apply ETL?

Currently, typical ETL tools include: Informatica, datastage, owb, Microsoft DTS, beeload, kettle ......

Open-source tools include s etl plug-ins. Cloveretl.

Data integration: Quick ETL implementation

ETL quality problems are embodied in the following features: correctness, integrity, consistency, completeness, effectiveness, timeliness, and availability. There are many causes of quality problems. The main causes of system integration and historical data include: inconsistent data models between systems in different periods of the Business System; the Business Process of the business system varies in different periods; the information of the old system module in operations, personnel, finance, and office systems is inconsistent; inconsistency between legacy systems and new businesses and management systems due to incomplete data integration.

To implement ETL, you must first implement the ETL conversion process. It can be embodied in the following aspects:

Null Value processing can capture the null values of a field, load or replace the data with other meanings, and load the data to different target databases based on the null values of the field.

The canonicalized data format defines the Field Format constraints. You can customize the loading format for time, value, character, and other data in the data source.

The split data can be decomposed based on business requirements. For example, the caller ID is 861082585313-8148. The domain code and telephone number can be decomposed.

To verify data correctness, you can use the lookup and splitting functions to verify data correctness. For example, if the caller ID is 861082585313-8148, after the domain code and phone number are decomposed, you can use lookup to return to the caller's region recorded by the gateway or switch for data verification.

Data replacement can replace invalid data and missing data due to business factors.

Lookup detects lost data lookup for subqueries and returns missing fields obtained by other means to ensure field integrity.

Illegal data with no dependency on the primary and foreign key constraints of the ETL process can be replaced or exported to the error data file to ensure the loading of unique primary key records.

In order to better implement ETL, we recommend that you pay attention to the following points during the ETL implementation process:

First, if conditions permit, operational data can be pre-processed using the data transfer area to ensure the efficiency of integration and loading;

Second, if the ETL process is "pull" rather than "push" internally, the controllability will be greatly enhanced;

Third, process-based configuration management and standard protocols should be developed before ETL;

Fourth, key data standards are crucial. At present, the greatest challenge facing ETL is the heterogeneous and low quality of data from various sources when receiving data. Taking China Telecom as an example, system A follows statisticsCodeManage data. system B manages data based on accounts and system c manages data based on voice IDs. When ETL needs to integrate these three systems to obtain a comprehensive view of the customer, this process requires complex matching rules, name/address normalization and standardization. ETL defines a key data standard during the processing process, and then develops the data interface standards.

The ETL process is largely influenced by enterprises' understanding of source data, which means data integration is very important from the business perspective. An excellent ETL design should have the following functions:

Simple management; centralized management using metadata methods; strict regulations on interfaces, data formats, and transmission; installation of software from external data sources as far as possible; automation of data extraction system processes, automatic scheduling is provided. The extracted data is timely, accurate, and complete. It can provide interfaces with various data systems and is highly adaptable to the system. It also provides software framework systems and applications when system functions change.ProgramFew changes can adapt to changes, and the scalability is strong.

Data Model: standard definition data

Reasonable business model design is crucial to ETL. Data Warehouse is the only, true, and reliable integrated data platform for enterprises. The Design and modeling of the data warehouse are generally based on the three paradigm, star model, and snowflake model. No matter which design concept, key business data should be maximized, the messy and disordered data structures in the operating environment are unified into a reasonable, associated, and analytic new structure, while ETL extracts data sources according to the definition of the model for conversion and cleaning, and finally load the data to the target data warehouse.

The important thing about the model is to standardize and define the data to achieve unified coding, unified classification and organization. Standardized definition includes: Unified Standard Code and unified business terms. ETL performs data integration based on models, such as initial loading, incremental loading, slow growth, slow change, and fact table loading, based on the business needs, the corresponding loading policy, refresh policy, summary policy, and maintenance policy are formulated.

Metadata: expands new applications

Metadata refers to the description and definition of the business data itself and its running environment ). Metadata is the data that describes the data. In a sense, business data is mainly used to support data of business system applications, metadata is indispensable for new applications such as enterprise information portals, customer relationship management, data warehouses, decision support, and B2B.

A typical example of metadata is the description of an object, that is, the description of a database, table, column, column attribute (type, format, constraint, etc.), primary key/external key Association, and so on. In particular, when the heterogeneity and distribution of current applications become more and more common, unified metadata becomes more and more important. "Information islands" was once a type of resentment and generalization by many enterprises on their application status quo, while reasonable metadata effectively depicts the relevance of information.

Metadata shows the following in a centralized way for ETL: define the data source location and data source attributes, determine the corresponding rules from the source data to the target data, determine the relevant business logic, and other necessary preparations before the data is actually loaded, it generally runs through the entire data warehouse project, and all ETL processes must reference metadata to the maximum extent, so as to quickly implement ETL.

ETL Architecture

It is the main component of the mainstream ETL product framework. ETL refers to extracting data from the source system, transforming data into a standard format, and loading data to the target data storage area, usually a data warehouse.

ETL architecture Diagram

Design manager provides a graphical ing environment for developers to define mappings, transformations, and processing processes from the source to the target. The logical definitions of each object in the design process are stored in a metadata database.

Meta data management provides a metadata database for definitions and management information such as ETL Design and Operation Processing. When the ETL engine is running and other applications, you can refer to the metadata in this database.

Extract extracts source data through interfaces, such? ODBC, dedicated database interface, and flat file extract, and determine the data extraction and extraction methods by reference to metadata.

The transform developer converts the extracted data into a target data structure based on business needs and summarizes the data.

Load the converted and aggregated data to the target data warehouse, which can be SQL or batch loaded.

Transport Services uses network or file protocols to move data between the source and target systems, and uses ETL-processed components to move data.

Administration and operation allows administrators to schedule, run, monitor ETL jobs, manage error messages, recover from failures, and adjust output from the source system based on events and time.

TL, short for extraction-transformation-loading. The Chinese name is data extraction, conversion, and loading. ETL tools include: owb (Oracle warehouse builder), Odi (Oracle data integrator), informatic powercenter, aicloudetl, datastage, repository explorer, beeload, kettle, dataspider

ETL extracts data from distributed and heterogeneous data sources, such as relational data and flat data files, to a temporary middle layer for cleaning, conversion, and integration. Finally, it loads the data to a data warehouse or a data set, it is the basis for Online Analytical Processing and data mining.

ETL is often used in data warehouses, but its objects are not limited to data warehouses.

ETL is an important part of data warehouses. It is a necessary step. Compared with relational databases, data warehouse technology does not have a strict mathematical theory basis, and it is more oriented to practical engineering applications. Therefore, from the perspective of engineering applications, data is loaded and processed in a series based on the requirements of the physical data model. The processing process is directly related to experience, at the same time, this part of work is directly related to the data quality in the data warehouse, thus affecting the quality of online analysis and processing and data mining results.

A data warehouse is an independent data environment. You need to import data from the online transaction processing environment, external data sources, and offline data storage media to the Data Warehouse through the extraction process. Technically, ETL mainly involves Association, transformation, increment, scheduling and monitoring. data in the data warehouse system is not required to be synchronized in real time with data in the online transaction processing system, so ETL can be performed on a regular basis. However, the time, sequence, and success/failure of multiple ETL operations are critical to the effectiveness of information in the data warehouse.

ETL (extract-transform-load) is the core and soul of Bi/DW (Business Intelligence, the ability to integrate and increase the value of data according to unified rules is the process of converting data from the data source to the target data warehouse and an important step for implementing the data warehouse. If the model design of the data warehouse is the blueprint of a building and the data is brick, ETL is the process of building. The most difficult part of the project is the user requirement analysis and model design, while the ETL rule design and implementation are the largest workload, accounting for about 60% of the project ~ 80%. This is a general consensus obtained from many practices at home and abroad.

ETL is a process of data extraction (extract), cleaning (cleaning), transformation (Transform), and load (load. It is an important part of building a data warehouse. The user extracts the required data from the data source, cleans the data, and finally loads the data to the data warehouse according to the pre-defined data warehouse model.

Information is an important resource of modern enterprises. It is the basis for enterprises to use scientific management and decision analysis. At present, most enterprises spend a lot of money and time building an online transaction processing OLTP business system and office automation system to record various related data of transaction processing. According to statistics, the data volume is every 2 ~ In three years, it will multiply. These data sets have huge commercial value, and enterprises generally only pay attention to 2% of the total data volume ~ About 4%. Therefore, enterprises still do not maximize the use of existing data resources, which wastes more time and money and loses the best chance to make key business decisions. Therefore, how enterprises use various technical means and convert data into information and knowledge has become the main bottleneck for improving their core competitiveness. ETL is a major technical means. How to select the ETL tool correctly? How to correctly apply ETL?

Currently, typical ETL tools include: Informatica, datastage, owb, Microsoft DTS, beeload, kettle ......

Open-source tools include s etl plug-ins. Cloveretl.

Data integration: Quick ETL implementation

ETL quality problems are embodied in the following features: correctness, integrity, consistency, completeness, effectiveness, timeliness, and availability. There are many causes of quality problems. The main causes of system integration and historical data include: inconsistent data models between systems in different periods of the Business System; the Business Process of the business system varies in different periods; the information of the old system module in operations, personnel, finance, and office systems is inconsistent; inconsistency between legacy systems and new businesses and management systems due to incomplete data integration.

To implement ETL, you must first implement the ETL conversion process. It can be embodied in the following aspects:

Null Value processing can capture the null values of a field, load or replace the data with other meanings, and load the data to different target databases based on the null values of the field.

The canonicalized data format defines the Field Format constraints. You can customize the loading format for time, value, character, and other data in the data source.

The split data can be decomposed based on business requirements. For example, the caller ID is 861082585313-8148. The domain code and telephone number can be decomposed.

To verify data correctness, you can use the lookup and splitting functions to verify data correctness. For example, if the caller ID is 861082585313-8148, after the domain code and phone number are decomposed, you can use lookup to return to the caller's region recorded by the gateway or switch for data verification.

Data replacement can replace invalid data and missing data due to business factors.

Lookup detects lost data lookup for subqueries and returns missing fields obtained by other means to ensure field integrity.

Illegal data with no dependency on the primary and foreign key constraints of the ETL process can be replaced or exported to the error data file to ensure the loading of unique primary key records.

In order to better implement ETL, we recommend that you pay attention to the following points during the ETL implementation process:

First, if conditions permit, operational data can be pre-processed using the data transfer area to ensure the efficiency of integration and loading;

Second, if the ETL process is "pull" rather than "push" internally, the controllability will be greatly enhanced;

Third, process-based configuration management and standard protocols should be developed before ETL;

Fourth, key data standards are crucial. At present, the greatest challenge facing ETL is the heterogeneous and low quality of data from various sources when receiving data. Taking China Telecom as an example, system a manages data based on statistical code, system B manages data based on account numbers, and system c manages data based on voice IDs. When ETL needs to integrate these three systems to obtain a comprehensive view of the customer, this process requires complex matching rules, name/address normalization and standardization. ETL defines a key data standard during the processing process, and then develops the data interface standards.

The ETL process is largely influenced by enterprises' understanding of source data, which means data integration is very important from the business perspective. An excellent ETL design should have the following functions:

Simple management; centralized management using metadata methods; strict regulations on interfaces, data formats, and transmission; installation of software from external data sources as far as possible; automation of data extraction system processes, automatic scheduling is provided. The extracted data is timely, accurate, and complete. It can provide interfaces with various data systems and is highly adaptable to the system. It also provides software framework systems and system functions that change, applications can adapt to changes with few changes; they are highly scalable.

Data Model: standard definition data

Reasonable business model design is crucial to ETL. Data Warehouse is the only, true, and reliable integrated data platform for enterprises. The Design and modeling of the data warehouse are generally based on the three paradigm, star model, and snowflake model. No matter which design concept, key business data should be maximized, the messy and disordered data structures in the operating environment are unified into a reasonable, associated, and analytic new structure, while ETL extracts data sources according to the definition of the model for conversion and cleaning, and finally load the data to the target data warehouse.

The important thing about the model is to standardize and define the data to achieve unified coding, unified classification and organization. Standardized definition includes: Unified Standard Code and unified business terms. ETL performs data integration based on models, such as initial loading, incremental loading, slow growth, slow change, and fact table loading, based on the business needs, the corresponding loading policy, refresh policy, summary policy, and maintenance policy are formulated.

Metadata: expands new applications

Metadata refers to the description and definition of the business data itself and its running environment ). Metadata is the data that describes the data. In a sense, business data is mainly used to support data of business system applications, metadata is indispensable for new applications such as enterprise information portals, customer relationship management, data warehouses, decision support, and B2B.

A typical example of metadata is the description of an object, that is, the description of a database, table, column, column attribute (type, format, constraint, etc.), primary key/external key Association, and so on. In particular, when the heterogeneity and distribution of current applications become more and more common, unified metadata becomes more and more important. "Information islands" was once a type of resentment and generalization by many enterprises on their application status quo, while reasonable metadata effectively depicts the relevance of information.

Metadata shows the following in a centralized way for ETL: define the data source location and data source attributes, determine the corresponding rules from the source data to the target data, determine the relevant business logic, and other necessary preparations before the data is actually loaded, it generally runs through the entire data warehouse project, and all ETL processes must reference metadata to the maximum extent, so as to quickly implement ETL.

ETL Architecture

It is the main component of the mainstream ETL product framework. ETL refers to extracting data from the source system, transforming data into a standard format, and loading data to the target data storage area, usually a data warehouse.

ETL architecture Diagram

Design manager provides a graphical ing environment for developers to define mappings, transformations, and processing processes from the source to the target. The logical definitions of each object in the design process are stored in a metadata database.

Meta data management provides a metadata database for definitions and management information such as ETL Design and Operation Processing. When the ETL engine is running and other applications, you can refer to the metadata in this database.

Extract extracts source data through interfaces, such? ODBC, dedicated database interface, and flat file extract, and determine the data extraction and extraction methods by reference to metadata.

The transform developer converts the extracted data into a target data structure based on business needs and summarizes the data.

Load the converted and aggregated data to the target data warehouse, which can be SQL or batch loaded.

Transport Services uses network or file protocols to move data between the source and target systems, and uses ETL-processed components to move data.

Administration and operation allows administrators to schedule, run, monitor ETL jobs, manage error messages, recover from failures, and adjust output from the source system based on events and time.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.