ETL is the process that the data of the business system is loaded into the data warehouse after being extracted and cleaned, the aim is to integrate the scattered, messy and standard data in the enterprise to provide the analysis basis for the decision of the enterprise.
ETL is the most important aspect of BI project, usually the ETL will spend 1/3 of the whole project time, ETL design directly connected to the success or failure of the BI project. ETL is also a long-term process, only the constant discovery of problems and solve problems, can make ETL running more efficient, for the project later development to provide accurate data.
ETL design is divided into three parts: data extraction, data cleaning conversion, data loading. In the design of the ETL is also from these three parts set out. Data extraction from different data sources to the ODS (this process can also do some data cleaning and conversion), in the extraction process needs to select different extraction methods, as far as possible to improve the operational efficiency of the ETL. ETL three parts, the longest time is the T (cleaning, conversion) of the part, in general, this part of the workload is the entire ETL 2/3. The loading of the data is typically written directly to the DW after the data has been cleaned.
The implementation of the ETL has a variety of methods, commonly used in three kinds, the first is the use of ETL tools such as Oracle OWB, SQL Server 2000 DTS, SQL Server2005 SSIS Services, informatic and so on, the second is the implementation of SQL, The third is the combination of ETL tools and SQL. The first two methods have advantages and disadvantages, the use of tools can quickly establish the ETL project, shielding complex coding tasks, improve speed, reduce the difficulty, but lack of flexibility. The advantage of SQL method is flexible, improve the efficiency of ETL operation, but the coding is complex and the technical requirement is high. The third is the combination of the previous two advantages, greatly improve the development of ETL speed and efficiency.
Extraction of data
Data extraction needs to do a lot of work in the research phase, first of all to understand the following questions: data from several business systems? What DBMS does the database server for each business system run? Is there manual data, how much is the manual data quantity? Is there unstructured data? And so on similar issues, When this information is collected, the design of data extraction can be done.