What is ETL?
In the construction of a data warehouse, ETL runs throughout the project. It is the lifeline of the entire data warehouse, including data cleansing, integration, conversion, and loading. If the data warehouse is a building, ETL is the foundation of the building. The quality of ETL data extraction and integration directly affects the presentation of final results. Therefore, ETL plays a key role in the entire data warehouse project and must be placed in a very important position.
ETL is short for extract, transform, and load. It refers to extracting data from the OLTP system, convert and integrate data from different data sources to obtain consistent data and then load it to the data warehouse.
Through ETL, we can generate a Data Warehouse Based on the data in the source system. ETL builds a bridge between the OLTP system and the OLAP system. It is a channel for data flowing from the source system to the data warehouse. In the project implementation of the data warehouse, it is related to the data quality of the entire project. Therefore, you must place the data warehouse in an important place and build the foundation of the data warehouse!
The ETL process is generally considered in the following four aspects:
(1) Use of the preparation area (ODS)
When building a data warehouse, if the data source is on one server and the data warehouse is on another server, the data warehouse must be constantly updated considering that the data source is frequently accessed by the server and the data volume is large, therefore, you can create a preparation zone database (ODS ). Data is first extracted to the preparation area and then processed based on the data in the preparation area. The advantage of this processing is to prevent frequent access to the original OLTP system and perform data operations or aggregation. If multiple types of database systems exist in the system, ODS can simplify the ETL development technology.
(2) Use of timestamps
Time dimension is very important for a fact topic. Because different times have different statistical data, the information recorded by time will play an important role. In ETL, timestamps have special functions. In the slow change dimension mentioned above, we can use timestamps to identify dimension members. When recording operations in databases and data warehouses, we will also use the timestamp to identify the information. For example, during data extraction, data in the OLTP system is extracted according to the timestamp, for example, the data from the previous day is retrieved, we will use the timestamp in the OLTP system to get the getdate to getdate minus one day, so that we can get the data of the previous day.
(3) Use of log tables
When processing data, data processing errors will inevitably occur, resulting in error information. How can we obtain error information and correct it in time? The method is to use one or more log tables to record the error information, in the log table, we will record the number of items extracted each time, the number of successfully processed items, the number of failed items, the data that failed to be processed, and the processing time. In this way, when a data error occurs, we can easily find the problem and correct or re-process the error data.
(4) Scheduling
Scheduling is required when incremental updates are made to the data warehouse, that is, incremental update is performed on the fact table. Before using scheduling, consider the actual data volume and determine how long it will take to update. For example, if you want to view data by day, we 'd better extract data by day. If the data volume is small, you can update the data by month or half a year. If the dimension changes slowly, you must take into account the updates of the dimension table during scheduling. Before updating the fact data table, you must update the dimension table.
Scheduling is a key part of the data warehouse and must be carefully considered. After the ETL process is set up, it needs to be run regularly. Therefore, scheduling is a key step for performing the ETL process. In addition to the data processing information written into the log table, each scheduling also needs to use the email or alarm service, which facilitates the technical staff to grasp the ETL process, enhanced security and data processing accuracy.