I think many people have talked about the ETL process. Recently, I have been comparing SSIs, owb, and infomatica. Combined with previous projects, I have deepened my understanding and understanding of the ETL process.
In fact, these three tools have their own advantages and disadvantages, except for the application platform. Today, I would like to share my experience in terms of expansion and maintenance.
1: Data Implementation and ELT
Many people talk about the concept of data implementation. To put it bluntly, if ODS or dwh contains data input or output, they first put the data to be operated into a temporary table, the table structures at both ends of data transmission are the same. This operation is convenient and there is almost no need to consider the conversion problem.
At the same time, we should also mention that ELT, ELT and ETL have their own advantages. One is to use tools to convert data, and the other is to use SP tools to process data in a row, SP is generally the processing of a column. I personally think that ELT is better to use SP to process data, because we need to change the mapping relationship between tables frequently during testing or future maintenance. SP only requires some modifications and operations in the database, and it is easier to debug to find the root cause of the problem.
2: script file
The script files can be divided into FTP script, table script, stored proc script, and shell script. Whether it is dos shell or Unix shell, these are control files of some operating systems. Why do we need to talk about these scripts? In fact, we are very simple in designing and developing the structure, but it is quite troublesome when we modify them, especially when the ETL tool opens a package, loading and verification are required. Sometimes it is slow and slow. If we only modify some small parameters, it will not be worth a long time. If we put all the script files together, you can easily open the txt or BAT file. In addition, for some temporary tables, we sometimes need to delete all the data. We can use truncate table, and sometimes drop and create are good choices.
Of course, if a script file is used, the host security should be good, and the table-level operation permissions of the account should be well allocated.
3. Policy table
We regularly run the package at different frequencies. Sometimes an error may occur, and sometimes the package cannot be executed on time due to other problems. In fact, we can make a policy table to store the time rules for executing each package, and initialize all the packages for execution on the current day before the execution starts every day to record the start and end times, there are also successful tags. During execution, you can read the execution status of the previous Execution day and extend the unfinished package to the current day.
4. log table
Accurately speaking, in ETL work, logs are recorded every small step starting from the execution of the package, recording what happened, what was done, and error information and control, when necessary, we need to release all the execution data of the related packages according to the transaction processing method of the database. In this way, the log table and the policy table are joined, so that we can clearly see the performance, efficiency, and accurate execution information. Based on the package priority, we can send log reports and urgent errors to relevant personnel through SMS or email.
5. Execute again
Refresh data is often encountered. If we find that the data is inaccurate, we may need to re-execute the data. Re-Execute may interrupt re-execution and completely re-execute the data, or even re-execute all the data.
We use batchid to control each day or each batch of data. But sometimes, due to this problem, our package is interrupted during execution, so we need to find logs, solve the problem and execute it again. There are many manual operations in this step. If you re-execute a packge on a certain day, You need to perform some operations on the corresponding data table, such as deleting and executing all the operations again, it means to reinnovate all the data in the data warehouse, which is equivalent to starting from 0.
We will encounter these operations. When designing a package, we should try our best to make the package complete by ourselves, and we should minimize human control. Some parameters can be used for control.
6. team collaboration
ETL accounts for 50%-70% of the workload of the entire data warehouse. Therefore, team collaboration must be better. ETL includes E, T, and l as well as log control, data model, original data verification, and data quality. For example, if we want to integrate the data of an enterprise in the Asia Pacific region, but each country has its own data source, some are ERP, some are access, and the database is different, so we should consider the network performance, if you directly Use ODBC to connect data sources in two locations, this is obviously unreasonable. Due to poor network performance and frequent connection, it is easy to cause the database link to crash because it cannot be released. If we place a program for exporting data to access or flat file on servers in various regions, the file can be easily transmitted through FTP. We need to have several tasks:
1) someone writes a common data export tool, which can be Java, VB, or other tools. In short, it is universal and can be controlled through different script files, the file formats exported from different databases in different regions are the same. Parallel operations can also be implemented.
2) someone writes an FTP program. bat can be used, ETL tools can be used, and other methods can be used. In short, it must be accurate and easy to call and control.
3) Someone has designed a data model, including 1) the exported structure in the vertex, as well as the table structure in ODS and dwh.
4) Some people write the SP, including the SP to be used in ETL and the SP for routine maintenance system, such as checking data quality.
5) Someone analyzed the original data, including the table structure, data quality, null values, and business logic.
6) someone is responsible for the development process, including implementing various functions and logging.
7) Someone tests
Exactly ETL is done by teams, and the power of one person is limited.