ETL (data extraction)

Source: Internet
Author: User

ETL: Abbreviation of extraction-transformation-loading. The Chinese name is data.
Extract, convert, and load data. ETL extracts data from distributed and heterogeneous data sources, such as relational data and flat data files, to a temporary middle layer for cleaning, conversion, integration, and finally loading data to a data warehouse or data warehouse.
Data mart has become the basis for Online Analytical Processing and data mining.

ETL
Is a very important part of the data warehouse. It is a necessary step. Compared with relational databases, data warehouse technology does not have a strict mathematical theory basis, and it is more oriented to practical engineering applications. Therefore
Data is loaded and processed in series according to the requirements of the physical data model. The processing process is directly related to experience, at the same time, this part of work is directly related to the data quality in the data warehouse, from
This affects the quality of Online Analytical Processing and data mining results.

Quantity
A data warehouse is an independent data environment. You need to import data from the online transaction processing environment, external data sources, and offline data storage media to the Data Warehouse through the extraction process. Technically, ETL mainly involves
Association, conversion, increment, scheduling, and monitoring. data in the data warehouse system is not required to be synchronized in real time with data in the online transaction processing system, so ETL can be performed on a regular basis. However, when multiple ETL operations are performed
Inter-, order, and success/failure are crucial to the effectiveness of information in the data warehouse.

ETL (extract-
The abbreviation of transform-load is the process of data extraction, conversion, and loading.
Intelligence) is the core and soul of implementing the Data Warehouse. It can integrate and improve the value of data according to unified rules. It is responsible for the process of converting data from the data source to the target data warehouse.
Important steps. If the model design of the data warehouse is the blueprint of a building and the data is brick, ETL is the process of building. The most difficult part of the project is user requirement analysis and Model setup.
While the design and implementation of ETL rules are the largest workload, accounting for about 60% of the total project ~ 80%. This is a general consensus obtained from many practices at home and abroad.

ETL is a process of data extraction (extract), transformation (Transform), cleaning (Cleansing), and loading (load. It is an important part of building a data warehouse. The user extracts the required data from the data source, cleans the data, and finally loads the data to the data warehouse according to the pre-defined data warehouse model.

Letter
Information is an important resource of modern enterprises and the Foundation for enterprises to use scientific management and decision analysis. At present, most enterprises spend a lot of money and time building online transaction processing OLTP business systems and office automation
System, used to record various data related to transaction processing. According to statistics, the data volume is every 2 ~ In three years, the data will multiply. This data contains great commercial value, and enterprises generally only focus on the total data volume.
2% ~ About 4%. Therefore, enterprises still do not maximize the use of existing data resources, which wastes more time and money and loses the best chance to make key business decisions. Therefore, how can enterprises access
Using various technical means and converting data into information and knowledge has become the main bottleneck for improving its core competitiveness. ETL is a major technical means. How to select the ETL tool correctly? Correct Response
ETL?

Currently, typical ETL tools include Informatica, datastage, owb, and Microsoft DTs ......

Data integration: Quick ETL implementation

ETL
The quality problems are manifested as correctness, integrity, consistency, completeness, effectiveness, timeliness and availability. There are many causes of quality problems, which are caused by system integration and historical data.
The main reasons include: inconsistent data models between systems in different periods of the business system; changes in business processes in different periods of the Business System; inconsistency of the old system module in operations, personnel, finance, office systems, and other related information;
Inconsistency resulting from incomplete integration of new systems, services, and management systems.

To implement ETL, you must first implement the ETL conversion process. It can be embodied in the following aspects:

Null Value processing can capture the null values of a field, load or replace the data with other meanings, and load the data to different target databases based on the null values of the field.

The canonicalized data format defines the Field Format constraints. You can customize the loading format for time, value, character, and other data in the data source.

The split data can be decomposed based on business requirements. For example, the caller ID 861084613409 can be used to separate the region code and telephone number.

To verify data correctness, you can use the lookup and splitting functions to verify data correctness. For example, if the caller ID is 861084613409, after the domain code and phone number are decomposed, you can use lookup to return the caller id region recorded by the gateway or switch for data verification.

Data replacement can replace invalid data and missing data due to business factors.

Lookup detects lost data lookup for subqueries and returns missing fields obtained by other means to ensure field integrity.

Illegal data with no dependency on the primary and foreign key constraints of the ETL process can be replaced or exported to the error data file to ensure the loading of the unique primary key record.

In order to better implement ETL, we recommend that you pay attention to the following points during the ETL implementation process:

First, if conditions permit, operational data can be pre-processed using the data transfer area to ensure the efficiency of integration and loading;

Second, if the ETL process is actively "pull", rather than "push" from the internal, its controllability will be greatly enhanced;

Third, process-based configuration management and standard protocols should be developed before ETL;

The
4. Key data standards are crucial. At present, the greatest challenge facing ETL is the heterogeneous and low quality of data from various sources when receiving data. Taking China Telecom as an example, system a manages data according to the Statistical code, and system B manages data according
Account digital management. The C system manages Accounts by voice ID. When ETL needs to integrate these three systems to gain a comprehensive perspective for customers, this process requires complex matching rules, name/address normalization, and tagging.
Normalization. ETL defines a key data standard in the processing process, and develops corresponding data interface standards on this basis.

The ETL process is largely influenced by enterprises' understanding of source data, which means data integration is very important from the business perspective. An excellent ETL design should have the following functions:

Management
Simple management; centralized management using metadata; strict specifications on interfaces, data formats, and transmission; installation of software from external data sources as far as possible; automation of data extraction system processes, automatic scheduling is available.
The retrieved data is timely, accurate, and complete. It can provide interfaces with various data systems and is highly adaptable to the system. It also provides software framework systems and system functions that change, applications can adapt to changes with few changes; scalability
Strong.

Data Model: standard definition data

Integration
Business Model Design is crucial to ETL. Data Warehouse is the only, true, and reliable integrated data platform for enterprises. Data warehouse design and modeling generally follow the three paradigm, star model, and snowflake model.
The design philosophy should cover key business data to the maximum extent, and unify the messy and disordered data structures in the operating environment into a reasonable, associated, and analytic new structure, ETL extracts data according to the definition of the model.
Data Source, convert, clean, and finally load to the target data warehouse.

Module
It is important to standardize the definition of data to achieve unified coding, unified classification and organization. Standardized definition includes: Unified Standard Code and unified business terms. ETL initial Addition Based on Model
Data integration, such as loading, incremental loading, slow growth, slow change, and fact table loading. The corresponding loading policies, refresh policies, summary policies, and maintenance policies are formulated according to business needs.

Metadata: expands new applications

Metadata refers to the description and definition of the business data itself and its running environment ). Metadata is the data that describes the data. In a sense, business data is mainly used to support data of business system applications, metadata is indispensable for new applications such as enterprise information portals, customer relationship management, data warehouses, decision support, and B2B.

RMB
A typical manifestation of data is the description of objects, that is, descriptions of databases, tables, columns, column attributes (types, formats, constraints, etc.), primary key/external key associations, and so on. In particular, the current application is becoming more and more heterogeneous and distributed.
In general, unified metadata becomes more and more important. "Information islands" was a type of complaint and generalization made by many enterprises on their application status quo, while reasonable metadata effectively depicts the relevance of information.

Metadata shows the following in a centralized way for ETL: define the data source location and data source attributes, determine the corresponding rules from the source data to the target data, determine the relevant business logic, and other necessary preparations before the data is actually loaded, it generally runs through the entire data warehouse project, and all ETL processes must reference metadata to the maximum extent, so as to quickly implement ETL.

ETL Architecture

It is the main component of the mainstream ETL product framework. ETL refers to extracting data from the source system, transforming data into a standard format, and loading data to the target data storage area, usually a data warehouse.

ETL architecture Diagram

Design manager provides a graphical ing environment for developers to define mappings, transformations, and processing processes from the source to the target. The logical definitions of each object in the design process are stored in a metadata database.

Meta data management provides a metadata database for definitions and management information such as ETL Design and Operation Processing. When the ETL engine is running and other applications, you can refer to the metadata in this database.

Extract extracts source data through interfaces, such as ODBC, dedicated database interfaces, and flat file extractors, and determines the data extraction method based on metadata.

The transform developer converts the extracted data into a target data structure based on business needs and summarizes the data.

Load the converted and aggregated data to the target data warehouse, which can be SQL or batch loaded.

Transport Services uses network or file protocols to move data between the source and target systems, and uses ETL-processed components to move data.

Administration and operation allows administrators to schedule, run, monitor ETL jobs, manage error messages, recover from failures, and adjust output from the source system based on events and time.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.