ETL design and consideration in Bi Projects
ETL is a process of extracting, cleaning, and transforming data from a business system and loading it into a data warehouse. It aims to integrate scattered, disorderly, and standardized data in an enterprise, it provides an analysis basis for enterprise decision-making. ETL is an important part of Bi projects. In bi projects, ETL usually takes 1/3 of the time of the entire project. The quality of ETL design is directly related to the success or failure of Bi projects.
ETL is designed in three parts: data extraction, data cleaning and conversion, and data loading. When designing ETL, we also start from these three parts. Data Extraction is to extract data from different data sources to ODS (this process can also clean and convert some data). In the extraction process, you need to select different extraction methods, improve the ETL operation efficiency as much as possible. Among the three ETL parts, the longest time spent is T (cleaning and conversion). In general, this part of work is 2/3 of the total ETL workload. Data loading is generally written directly to DW after data cleansing.
There are multiple ETL implementation methods and three are commonly used. One is to implement ETL tools such as Oracle owb, SQL Server 2000 DTS, SQL server2005 SSIS service, and informatic, one is SQL, and the other is the combination of ETL tools and SQL. The first two methods have their own advantages and disadvantages. With the help of tools, you can quickly establish an ETL project, shielding complex coding tasks, improving the speed and reducing the difficulty, but lack flexibility. The SQL method has the advantages of being flexible and improving the ETL operation efficiency. However, the coding is complicated and the technical requirements are relatively high. The third is to combine the advantages of the first two methods to greatly improve the ETL development speed and efficiency.
I. Data Extraction
This part requires a lot of work in the research phase. First, we need to figure out the data from several business systems, what DBMS is run on the database servers of each business system, and whether manual data exists, the size of manual data. Whether unstructured data exists. Data extraction can be performed only after the information is collected.
1. The same data source processing method as the database system that stores DW
this type of data source is easy to design. In general, DBMS (sqlserver, Oracle) the database connection function is provided, and a direct link between the DW database server and the original business system can be directly accessed by writing the SELECT statement.
2. Processing Methods for data sources different from DW database systems.
for this type of data source, you can also establish a database connection through ODBC, such as between SQL Server and Oracle. And then import these source system files to ODS. Another method is to use the Program interface.
32.16on the file metadata data source (.txt,.xls)
business personnel can be trained to use database tools to import the data to the specified database and then extract the data from the specified database. You can also use tools, such as SQL Server 2005's SSIS service, to import flat data sources, flat targets, and other components into ODS.
4. incremental update
for systems with large data volumes, incremental extraction must be considered. In general, the business system records the time when the business occurs. We can use it as an incremental indicator to determine the maximum time recorded in ODS before each extraction, then, the business system obtains all records later than the time. The timestamp of the Business System is used. Generally, the business system does not have or has a timestamp.
II. data cleaning and conversion
generally, data warehouses are divided into ODS and DW. The common practice is to clean data from business systems to ODS, filters out dirty data and incomplete data, converts data from ODS to DW, and computes and aggregates some business rules.
1. Data cleansing
the task of data cleansing is to filter the non-conforming data and send the filtered results to the competent business department, confirm whether to filter out or make corrections by the business unit before extraction. Non-conforming data mainly includes incomplete data, incorrect data, and repeated data.
A. Incomplete data. This type of data is mainly due to missing information, such as the supplier name and branch name, the customer's region information is missing, and the main table in the business system does not match the list. For this type of data, filter out and write different Excel files to the customer based on the missing content, it must be completed within the specified time. Data Warehouse is written only after completion.
B. incorrect data. This type of error occurs because the business system is not sound enough and is directly written to the background database after receiving the input, for example, if the numeric data is converted into full-angle numeric characters, the string data is followed by a carriage return, the date format is incorrect, and the date is out of range, such data must also be classified, for issues similar to full-angle characters and non-printable characters before and after data, you can only find them by writing SQL statements, and then ask the customer to extract them after the business system is corrected, incorrect date format or out-of-date errors may cause ETL operation failure. This type of error needs to be identified by SQL in the Business System database, submit the request to the competent business department for correction within a time limit, and then extract the correction.
C. Duplicate data. This type of data, especially in dimension tables, is used to export all fields of repeated data records so that the customer can confirm and organize the data.
data cleansing is an iterative process that cannot be completed within a few days. It is only possible to continuously discover and solve problems. For filtering or correction, the customer is generally required to confirm whether the filtered data is written into an Excel file or the filtered data is written into a data table, at the initial stage of ETL development, you can send an email to the business unit to filter data, prompting them to correct errors as soon as possible. It can also serve as a basis for future data verification. During data cleansing, do not filter out useful data. verify each Filtering Rule carefully and confirm the data.
2. Data Conversion
Data conversion tasks mainly perform inconsistent data conversion, data granularity conversion, and calculation of some business rules.
A. inconsistent data conversion. This process is an integrated process that unifies the data of the same type in different business systems. For example, the Code of the settlement system of the same supplier is xx0001, in CRM, the encoding is yy0001, which is converted into an encoding after extraction.
B. Data granularity conversion business systems generally store very detailed data, while data in data warehouses are used for analysis and do not require very detailed data. Generally, the business system data is aggregated according to the Data Warehouse granularity.
C. Calculation of business rules different enterprises have different business rules and different data indicators. These indicators are sometimes not completed simply by addition, subtraction, and deduction, in this case, these data indicators need to be computed in ETL and stored in the data warehouse for analysis and use.
3. ETL logs and warning sending.
1. ETL logs
ETL logs are classified into three types: Execution Process logs. These logs record each step of the ETL Execution Process and record the start time of each step, the number of rows of data is affected. The first type is error logs. When a module encounters an error, the error logs are written to record the time of each error, the error module, and the error information. The third type of log is the overall log. Only the ETL start time and end time are recorded. If
Using the ETL tool, the ETL Tool automatically generates some logs, which can also be part of the ETL logs. The purpose of logging is to know the ETL running status at any time. If an error occurs, the error is recorded.
2. Send a warning
If an ETL error occurs, not only the ETL error log, but also the system administrator should be warned. There are multiple ways to send warnings. Generally, an email is sent to the system administrator and an error message is attached, this helps administrators troubleshoot errors.
ETL is a key part of Bi projects and a long-term process. Only by constantly discovering and solving problems can ETL be more efficient and provide accurate data for post-project development.
Editor's note: You can join the discussion in this article through the following link to directly talk to the author: