ETL is the process of data extraction (Extract), Transformation (Transform), loading (load). It is an important part of building data Warehouse. Data Warehouse is a theme-oriented, integrated, stable and constantly changing data collection to support the decision making process in the management. There may be a large number of noise data in the Data Warehouse system, and the main causes are: misuse of abbreviations, idioms, data entry errors, duplicate records, lost values, spelling changes, and so on. Even a well-designed database system, if there is a lot of noise data, the system is meaningless, because "garbage in, Garbage out" (garbage in, garbage out), the system is simply not possible to provide any support for the decision analysis system. In order to eliminate noise data, data cleaning must be done in the database system. At present, there are many data cleaning research and ETL research, but how to carry out effective data cleaning in the ETL process and make this process visualization, this research is not much. This article mainly from two aspects of the ETL and data cleaning implementation process: ETL processing mode [19] and data cleaning implementation methods.
(1) The processing mode of ETL
The ETL method used in this article is the ETL processing way in the database segment area, it uses the database as the unique control point instead of using the external engine. Because the source system SQLserver2000 is a relational database, its Cong is also a typical relational table. The external unmodified data is successfully loaded into the database and then converted inside the database. The process of ETL processing in the database segment area is to extract, load, and transform, that is, the commonly spoken ELT. [21] The advantage of this approach is that the extracted data is first provided with a buffer to facilitate the complex conversion, reducing the complexity of the ETL process.
(2) Realization method of data cleaning in ETL process
First of all, the realization of data table property uniformity based on the understanding of the source data. To solve the problem of the synonym of the source data and the ambiguity of the same name, the metadata management subsystem, while understanding the source data, redefine the names of the different tables in the Data Mining library according to their meanings, and store them in the metadata database in the form of the conversion rules, when the data is integrated, The system automatically converts the field names in the source data to the newly defined field names based on these conversion rules, thereby implementing the synonym of the same name in the Data Mining library.
Second, by reducing data, the volume of data is greatly reduced. Because of the large amount of source data and time-consuming processing, data reduction can be prioritized to improve the efficiency of subsequent data processing analysis.
Finally, the visualization of data cleaning and data conversion is achieved by setting up the visual function node of data processing beforehand. For data reduction and integration, a variety of data processing function nodes are provided through the combination preprocessing subsystem, which can quickly and efficiently complete data cleaning and data conversion process in a visual way.
ETL is the process of data extraction (Extract), Transformation (Transform), loading (load). It is an important part of building data Warehouse. Data Warehouse is a theme-oriented, integrated, stable and constantly changing data collection to support the decision making process in the management. There may be a large number of noise data in the Data Warehouse system, and the main causes are: misuse of abbreviations, idioms, data entry errors, duplicate records, lost values, spelling changes, and so on. Even a well-designed database system, if there is a lot of noise data, the system is meaningless, because "garbage in, Garbage out" (garbage in, garbage out), the system is simply not possible to provide any support for the decision analysis system. In order to eliminate noise data, data cleaning must be done in the database system. At present, there are many data cleaning research and ETL research, but how to carry out effective data cleaning in the ETL process and make this process visualization, this research is not much. This article mainly from two aspects of the ETL and data cleaning implementation process: ETL processing mode [19] and data cleaning implementation methods.
(1) The processing mode of ETL
The ETL method used in this article is the ETL processing way in the database segment area, it uses the database as the unique control point instead of using the external engine. Because the source system SQLserver2000 is a relational database, its Cong is also a typical relational table. The external unmodified data is successfully loaded into the database and then converted inside the database. The process of ETL processing in the database segment area is to extract, load, and transform, that is, the commonly spoken ELT. [21] The advantage of this approach is that the extracted data is first provided with a buffer to facilitate the complex conversion, reducing the complexity of the ETL process.
(2) Realization method of data cleaning in ETL process
First of all, the realization of data table property uniformity based on the understanding of the source data. To solve the problem of the synonym of the source data and the ambiguity of the same name, the metadata management subsystem, while understanding the source data, redefine the names of the different tables in the Data Mining library according to their meanings, and store them in the metadata database in the form of the conversion rules, when the data is integrated, The system automatically converts the field names in the source data to the newly defined field names based on these conversion rules, thereby implementing the synonym of the same name in the Data Mining library.
Second, by reducing data, the volume of data is greatly reduced. Because of the large amount of source data and time-consuming processing, data reduction can be prioritized to improve the efficiency of subsequent data processing analysis.
Finally, the visualization of data cleaning and data conversion is achieved by setting up the visual function node of data processing beforehand. For data reduction and integration, a variety of data processing function nodes are provided through the combination preprocessing subsystem, which can quickly and efficiently complete data cleaning and data conversion process in a visual way.
(Excerpt from Wang Maihui: Research on the construction of a data mining business platform)