What is ETL?

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

ETL is the process of data extraction (Extract), Transformation (Transform), loading (load). It is an important part of building data Warehouse. Data Warehouse is a theme-oriented, integrated, stable and constantly changing data collection to support the decision making process in the management. There may be a large number of noise data in the Data Warehouse system, and the main causes are: misuse of abbreviations, idioms, data entry errors, duplicate records, lost values, spelling changes, and so on. Even a well-designed database system, if there is a lot of noise data, the system is meaningless, because "garbage in, Garbage out" (garbage in, garbage out), the system is simply not possible to provide any support for the decision analysis system. In order to eliminate noise data, data cleaning must be done in the database system. At present, there are many data cleaning research and ETL research, but how to carry out effective data cleaning in the ETL process and make this process visualization, this research is not much. This article mainly from two aspects of the ETL and data cleaning implementation process: ETL processing mode [19] and data cleaning implementation methods.
(1) The processing mode of ETL
The ETL method used in this article is the ETL processing way in the database segment area, it uses the database as the unique control point instead of using the external engine. Because the source system SQLserver2000 is a relational database, its Cong is also a typical relational table. The external unmodified data is successfully loaded into the database and then converted inside the database. The process of ETL processing in the database segment area is to extract, load, and transform, that is, the commonly spoken ELT. [21] The advantage of this approach is that the extracted data is first provided with a buffer to facilitate the complex conversion, reducing the complexity of the ETL process.
(2) Realization method of data cleaning in ETL process
First of all, the realization of data table property uniformity based on the understanding of the source data. To solve the problem of the synonym of the source data and the ambiguity of the same name, the metadata management subsystem, while understanding the source data, redefine the names of the different tables in the Data Mining library according to their meanings, and store them in the metadata database in the form of the conversion rules, when the data is integrated, The system automatically converts the field names in the source data to the newly defined field names based on these conversion rules, thereby implementing the synonym of the same name in the Data Mining library.
Second, by reducing data, the volume of data is greatly reduced. Because of the large amount of source data and time-consuming processing, data reduction can be prioritized to improve the efficiency of subsequent data processing analysis.
Finally, the visualization of data cleaning and data conversion is achieved by setting up the visual function node of data processing beforehand. For data reduction and integration, a variety of data processing function nodes are provided through the combination preprocessing subsystem, which can quickly and efficiently complete data cleaning and data conversion process in a visual way.
ETL is the process of data extraction (Extract), Transformation (Transform), loading (load). It is an important part of building data Warehouse. Data Warehouse is a theme-oriented, integrated, stable and constantly changing data collection to support the decision making process in the management. There may be a large number of noise data in the Data Warehouse system, and the main causes are: misuse of abbreviations, idioms, data entry errors, duplicate records, lost values, spelling changes, and so on. Even a well-designed database system, if there is a lot of noise data, the system is meaningless, because "garbage in, Garbage out" (garbage in, garbage out), the system is simply not possible to provide any support for the decision analysis system. In order to eliminate noise data, data cleaning must be done in the database system. At present, there are many data cleaning research and ETL research, but how to carry out effective data cleaning in the ETL process and make this process visualization, this research is not much. This article mainly from two aspects of the ETL and data cleaning implementation process: ETL processing mode [19] and data cleaning implementation methods.
(1) The processing mode of ETL
The ETL method used in this article is the ETL processing way in the database segment area, it uses the database as the unique control point instead of using the external engine. Because the source system SQLserver2000 is a relational database, its Cong is also a typical relational table. The external unmodified data is successfully loaded into the database and then converted inside the database. The process of ETL processing in the database segment area is to extract, load, and transform, that is, the commonly spoken ELT. [21] The advantage of this approach is that the extracted data is first provided with a buffer to facilitate the complex conversion, reducing the complexity of the ETL process.
(2) Realization method of data cleaning in ETL process
First of all, the realization of data table property uniformity based on the understanding of the source data. To solve the problem of the synonym of the source data and the ambiguity of the same name, the metadata management subsystem, while understanding the source data, redefine the names of the different tables in the Data Mining library according to their meanings, and store them in the metadata database in the form of the conversion rules, when the data is integrated, The system automatically converts the field names in the source data to the newly defined field names based on these conversion rules, thereby implementing the synonym of the same name in the Data Mining library.
Second, by reducing data, the volume of data is greatly reduced. Because of the large amount of source data and time-consuming processing, data reduction can be prioritized to improve the efficiency of subsequent data processing analysis.
Finally, the visualization of data cleaning and data conversion is achieved by setting up the visual function node of data processing beforehand. For data reduction and integration, a variety of data processing function nodes are provided through the combination preprocessing subsystem, which can quickly and efficiently complete data cleaning and data conversion process in a visual way.

(Excerpt from Wang Maihui: Research on the construction of a data mining business platform)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

What is ETL?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

What is ETL?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support