ETL is the process that the data of business system is pumped into the data warehouse after being cleaned and transformed, the purpose is to integrate the data of the enterprise in the scattered, messy and standard, and provide the analysis basis for the enterprise's decision. ETL is an important part of BI project. Typically, in BI projects, ETL spends at least 1/3 of the time of the project, and the ETL design is directly connected to the success or failure of the BI project.
ETL design is divided into three parts: data extraction, data cleaning and transformation, data loading. In the design of ETL, we also start from these three parts. The extraction of data from various data sources to the ODS (operational data store, operational storage)-This process can also do some data cleaning and conversion, in the process of extraction need to select different extraction methods, as far as possible to improve the efficiency of ETL operation. ETL three parts, takes the longest time is "T" (Transform, cleans, transforms) the part, generally this part of the workload is the entire ETL 2/3. Data loading is usually done directly after the data has been cleaned and written to the DW (data warehousing, warehouse).
ETL implementation has a variety of methods, commonly used there are three kinds. One is implemented using ETL tools such as Oracle's OWB, SQL Server 2000 DTS, SQL Server2005 SSIS Services, Informatic, and so on, one in SQL and the other with ETL tools and SQL. The first two methods have their own advantages and disadvantages, with the help of tools can quickly set up ETL project, shielding the complex coding tasks, improve the speed, reduce the difficulty, but lack of flexibility. The advantage of SQL method is flexibility, improve ETL running efficiency, but the coding is complex, the technical requirements are higher. The third is to synthesize the advantages of the preceding two, which will greatly improve the development speed and efficiency of ETL.
I. Extraction of data (Extract)
This part needs to do a lot of work in the research phase, first of all to understand the data is from several business systems, the various business Systems Database server running what DBMS, whether there is manual data, how much manual data, whether there is unstructured data, etc. When this information is collected, data extraction can be designed.
1. The same data source processing method as the database system storing DW
This type of data source is relatively easy to design. In general, the DBMS (SQL Server, Oracle) provides a database link function that establishes a direct link between the DW database server and the original business system and can be accessed directly by writing a SELECT statement.
2, for the DW database system different data source processing method
For this type of data source, it is generally possible to establish database links through ODBC, such as between SQL Server and Oracle. If you cannot establish a database link, you can do it in two ways, by exporting the source data to a. txt or. xls file through a tool, and then importing the source system files into the ODS. Another way to do this is through a program interface.
3. For file type data sources (. txt,.xls), you can train business people to use database tools to import the data into the specified database and then extract it from the specified database. Or it can be implemented with tools.
4. Issues with incremental updates
For systems with large data volumes, incremental extraction must be considered. In general, the business system will record the time the business takes place, we can use it to make an incremental flag, determine the maximum time recorded in the ODS before each extraction, and then go to the business system at this time to take all the records that are greater than this time. With the time stamp of the business system, in general, the business system has no or part of the time stamp.
second, the data cleaning conversion (cleaning, Transform)
In general, the data Warehouse is divided into two parts: ODS and DW. It is common practice to clean from the business system to the ODS, to filter out dirty and incomplete data, to convert from ODS to DW, to calculate and aggregate business rules.
1. Data cleaning
The task of data cleansing is to filter out the data that does not meet the requirements, and give the result of the filter to the business competent department to confirm whether it is filtered or modified by the business unit before extracting it.
the non-conforming data mainly consists of incomplete data, wrong data and repeated data.
(1) Incomplete data: This type of data is mainly some of the information should be missing, such as the name of the supplier, the name of the branch, the customer's regional information is missing, the main table in the business system does not match the schedule. For this kind of data filtering out, according to the missing content written to different Excel files to submit to the customer, required to complete within the specified time. The Data warehouse is not written until the completion is complete.
(2) Wrong data: This type of error occurs because the business system is not sound enough, after receiving input is not judged directly to the background database, such as numeric data output complete corner numeric characters, string data followed by a carriage return operation, date format is incorrect, date out of bounds and so on. This type of data is also classified, for similar to full-width characters, the data is not visible before and after the problem, can only be written in the form of SQL statements to find out, and then ask the customer after the business system is modified after the extraction . Date format is incorrect or the date out of bounds of this kind of error will cause the ETL run failure, this kind of error needs to go to the Business System database by SQL to pick out, to the business departments to request the deadline correction, corrected and then extracted.
(3) Duplicate data: For this type of data-especially in a dimension table-export all the fields of a repeating data record to be confirmed and collated by the customer.
Data cleansing is a recurring process that cannot be completed within a few days, only to constantly identify problems and solve problems. For filtering, whether the correction is generally required to confirm the customer, for the filtered data, write to Excel file or write filtering data to the data table, in the early stage of ETL development can be sent to business units every day to filter the data of the mail, prompting them to correct the error as soon as possible, but also can be used as the basis for future verification data. Data cleaning needs to be noted is not to filter out useful data, for each filtering rule carefully verified, and to be confirmed by the user.
2. Data Conversion
The task of data transformation is mainly inconsistent data conversion, data granularity conversion, and some business rules calculation.
(1) Inconsistent data conversion: This process is an integrated process, the same type of data of different business systems unified, than the same as a vendor in the billing system is XX0001, and in the CRM code is YY0001, so that after extracting the unified into a code.
(2) Data granularity conversion: The business system generally stores very detailed data, and data warehouse data is used for analysis, do not need very detailed data. In general, business system data is aggregated according to the granularity of the Data warehouse.
(3) the calculation of business rules : Different enterprises have different business rules, different data indicators, these indicators are sometimes not simple plus minus minus can be completed, this time need to be in the ETL of these data indicators are stored in the Data Warehouse, for analysis use.
Third, ETL log, warning sent
1. ETL Log
ETL logs fall into three categories.
One is the execution of the process log, this part of the log is in the ETL execution process every step of the record, record each run each step of the start time, affecting the number of rows of data, Journal form.
One is the error log, when a module fails to write the error log, record the time of each error, the wrong module, and the error message.
The third type of log is the overall log, which records only the ETL start time and the end time for successful information. If the ETL tool is used, the ETL tool automatically generates some logs, which can also be used as part of the ETL log.
The purpose of logging is to be ready to know the ETL running situation, if the error, you can know where the error.
2. Warning Send
If the ETL error, not only to form an ETL error log, and to send a warning to the system administrator. There are many ways to send a warning, which is commonly used to send a message to the system administrator and attach the error message to facilitate the administrator to troubleshoot the error.
ETL is a key part of BI project, is also a long-term process , only to constantly identify problems and solve problems, to make ETL run more efficient, for BI project late development to provide accurate and efficient data.
PostScript to do data Warehouse system, ETL is a key link. Said Big,ETL is a data integration solution, said small, is to pour data tools . Recalling the work for so long time, the processing of data migration, conversion is really a lot of work. But those jobs are basically a one-time job or a small amount of data. However, in the Data Warehouse system, ETL has risen to a certain degree of theoretical height, and the original use of the tool is different. What is different, from the name can be seen, people have been the process of the data is divided into 3 steps, E, T, l respectively represents the extraction, conversion and loading.
In fact, ETL process is the process of data flow, from different data sources to different target data. But in the Data Warehouse,
ETL has several characteristics,
One is the data synchronization, it is not a one-time rewind data to pull to, it is a regular activity, in accordance with the fixed cycle of operation, and even now some people put forward the concept of the Realtime ETL.
Two is the amount of data, which is generally huge, and it is worthwhile to split the flow of data into E, T, and L. now has a lot of mature tools to provide ETL functionality, and do not say they are good or bad. From the application point of view, the ETL process is not very complex, these tools to the Data Warehouse engineering and great convenience, especially the development of convenience and maintenance convenience. But on the other hand, developers are easily lost in these tools. For example, VB is a very simple language and is also very easy to use programming tools, get started particularly fast, but the real master of VB how many. Microsoft's products usually have a principle of "treat users as Fools", in this principle, Microsoft's things are very useful, but for developers, if you yourself as a fool, it is really silly. The same is true of ETL tools, which provide us with a graphical interface that allows us to focus our efforts on rules to improve development efficiency. From the use of the effect, it is true that these tools can be used to build a job very quickly to process a certain data, but overall, it is not likely that his overall efficiency will be much higher. The problem is not primarily on the tools, but on the design and development staff. They are lost in the tool, not to explore the nature of ETL. It can be said that these tools have been applied for such a long time, in so many projects, the environment, it must have its success, it must reflect the nature of ETL. If we don't look at the idea behind the simple use of these tools on the surface, what we finally make is a separate job, and it's still a huge amount of work to put them together. We all know that "the combination of Theory and practice", if in a field beyond, must be in the theoretical level to reach a certain height.