ETL Design in BI project of data extraction, cleaning and transformation

Source: Internet
Author: User
Tags manual

ETL is the process that the data of the business system is loaded into the data warehouse after being extracted and cleaned, the aim is to integrate the scattered, messy and standard data in the enterprise to provide the analysis basis for the decision of the enterprise.

ETL is the most important aspect of BI project, usually the ETL will spend 1/3 of the whole project time, ETL design directly connected to the success or failure of the BI project. ETL is also a long-term process, only the constant discovery of problems and solve problems, can make ETL running more efficient, for the project later development to provide accurate data.

ETL design is divided into three parts: data extraction, data cleaning conversion, data loading. In the design of the ETL is also from these three parts set out. Data extraction from different data sources to the ODS (this process can also do some data cleaning and conversion), in the extraction process needs to select different extraction methods, as far as possible to improve the operational efficiency of the ETL. ETL three parts, the longest time is the T (cleaning, conversion) of the part, in general, this part of the workload is the entire ETL 2/3. The loading of the data is typically written directly to the DW after the data has been cleaned.

The implementation of the ETL has a variety of methods, commonly used in three kinds, the first is the use of ETL tools such as Oracle OWB, SQL Server 2000 DTS, SQL Server2005 SSIS Services, informatic and so on, the second is the implementation of SQL, The third is the combination of ETL tools and SQL. The first two methods have advantages and disadvantages, the use of tools can quickly establish the ETL project, shielding complex coding tasks, improve speed, reduce the difficulty, but lack of flexibility. The advantage of SQL method is flexible, improve the efficiency of ETL operation, but the coding is complex and the technical requirement is high. The third is the combination of the previous two advantages, greatly improve the development of ETL speed and efficiency.

Extraction of data

Data extraction needs to do a lot of work in the research phase, first of all to understand the following questions: data from several business systems? What DBMS does the database server for each business system run? Is there manual data, how much is the manual data quantity? Is there unstructured data? And so on similar issues, When this information is collected, the design of data extraction can be done.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.