Etl design in Bi project for data extraction, cleaning and conversion

Source: Internet
Author: User
Tags ssis

ETL is a process of extracting, cleaning, and transforming data from a business system and loading it into a data warehouse. It aims to integrate scattered, disorderly, and standardized data in an enterprise, providing analysis basis for enterprise decision-making

ETLYesBiThe most important part of a project, usuallyETLIt takes 1/3 of the total project time,ETLThe quality of the design depends on the success or failure of the Bi project.ETLIt is also a long-term process. Only by constantly discovering and solving problems can weETLHigher operation efficiency, providing accurate data for later project development.

  ETLThe design is divided into three parts: data extraction, data cleaning and conversion, and data loading. In DesignETLIt also starts from these three parts. Data Extraction is to extract data from different data sources to ODS (this process can also clean and convert some data). In the extraction process, you need to select different extraction methods, improve as much as possibleETL. Among the three ETL parts, the longest time spent is the T (cleaning and conversion) part. Generally, this part of work is the wholeETL2/3. Data loading is generally written directly to DW after data cleansing.

  ETLThere are multiple implementation methods, three common methods, the first is to useETLTools such as owb of Oracle, DTS of SQL Server 2000, SSIS of SQL Server, and informatic. The second isSQLThe third method isETLTools andSQL. The first two methods have their own advantages and disadvantages. With tools, you can quickly establish an ETL project, shield complicated coding tasks, increase the speed and reduce the difficulty, but lack flexibility.SQLThe method is flexible and improved.ETLIt is efficient, but the code is complex and has high technical requirements. The third is to combine the advantages of the first two methods to greatly improveETLDevelopment speed and efficiency.

  Data Extraction

Data extraction requires a lot of work in the research phase. First of all, we need to clarify the following questions: how many business systems do data come from? TheDatabaseWhat is the server running?DBMS? Is there manual data? Is there any unstructured data? And so on. Data extraction can be designed only after the information is collected.

1.DatabaseSame data source Processing Method

This type of data source is easy to design. Generally,DBMS(Including sqlserver and Oracle) will provideDatabaseLink function, in DWDatabaseA direct link between the server and the original business system can be directly accessed using the SELECT statement.

2. Comparison with DWDatabaseProcessing Methods of different data sources in the system.

This type of data source can also be established through ODBC, such as between SQL Server and Oracle. If you cannot createDatabaseTwo methods can be used for linking. One is to export source data to A. txt file through tools, and then import these source system files to ODS. Another method is implemented through the program interface.

3rd, you can train business personnel to use the file source (.txt, XLS)DatabaseThe tool imports the data to the specified database, and thenDatabaseExtraction. You can also use tools, such as SQL Server 2005's SSIS service, to import flat data sources, flat targets, and other components into ODS.

4. incremental update

For systems with large data volumes, incremental extraction must be considered. Generally, the business system records the time when the business occurs and can be used as an incremental indicator. Before each extraction, the maximum time recorded in ODS is determined, then, the business system obtains all records later than the time. The timestamp of the Business System is used. Generally, the business system does not have or has a timestamp.

  Cleaning and conversion of data

Generally,Data WarehouseIt is divided into two parts: ODS and DW. The common practice is to clean from the Business System to ODS, filter out dirty data and incomplete data, and then convert from ODS to DW, compute and aggregate some business rules.

1. Data cleansing

The data cleansing task is to filter out the non-conforming data and send the filtered results to the competent business department to confirm whether the data is filtered out or corrected by the business unit before extraction. Non-conforming data mainly involves incomplete data, incorrect data, and repeated data.

  • A. Incomplete Data is characterized by missing information, such as the supplier name and branch name, the customer's region information is missing, and the master table and schedule in the business system cannot match. This type of data needs to be filtered out and written into different Excel files based on the missing content to be submitted to the customer. The data must be supplemented within the specified time. Data Warehouse is written only after completion.
  • B. The error data is generated because the business system is not sound enough and is directly written to the background after receiving the input.DatabaseFor example, if the numeric data is converted into full-angle numeric characters, the string data is followed by a carriage return, the date format is incorrect, and the date is out of bounds. This type of data also needs to be classified. For problems similar to full-width characters and non-specific characters before and after data, you can only writeSQLAnd then ask the customer to extract the data after the business system is corrected. errors such as incorrect date format or out-of-date errors may causeETLFailed to run, this type of error needs to go to the Business SystemDatabaseUseSQLTo the competent business department for correction within a time limit.
  • C. Duplicate data, especially in dimension tables, are common. All fields of repeated data records are exported for the customer to confirm and organize.

Data cleansing is an iterative process that cannot be completed within a few days. It is only possible to continuously discover and solve problems. Customers are generally required to confirm whether to filter and correct the data. For filtered data, write the data into an Excel file or write the filtered data into the data table.ETLAt the initial stage of development, you can send an email to the business unit to filter data, prompting them to correct errors as soon as possible. It can also serve as a basis for future data verification. For data cleansing, please note that you should not filter out useful data. verify each Filtering Rule carefully and confirm it with the user.

2. Data Conversion

Data conversion tasks mainly involve inconsistent data conversion, data granularity conversion, and calculation of some business rules.

  • A. inconsistent data conversion. This process is an integrated process that unifies the data of the same type in different business systems. For example, the Code of the settlement system of the same supplier is xx0001, in CRM, the encoding is yy0001, which is converted into an encoding after extraction.
  • B. Data granularity conversion. business systems generally store very detailed data, while data in data warehouses are used for analysis and do not require very detailed data. Generally, the business system data is aggregated according to the Data Warehouse granularity.
  • C. For the calculation of business rules, different enterprises have different business rules and different data indicators. These indicators are sometimes not completed simply by addition, subtraction, or subtraction.ETLAfter these data indicators are computed, they are stored inData WarehouseFor analysis.

  ETL logs and warning sending

1,ETLLogs. The purpose of logging is to be known at any time.ETLRunning condition. If an error occurs, the error occurs.

  ETLLogs are classified into three types. The first type is the execution process log, which is inETLRecords the starting time of each running step during execution, which affects the number of rows of data and the billing method. The second type is error logs. When an error occurs in a module, you need to write error logs to record the time of each error, the error module, and the error information. The third type of log is the overall log. Only the ETL start time and end time are recorded.

If you useETLThe tool automatically generates some logs.ETLPart of the log.

2. Send a warning

  ETLIf an error occurs, not only writeETLError logs must be sent to the system administrator in a variety of ways. commonly used methods are to send emails to the system administrator and attach error messages to facilitate the Administrator to troubleshoot errors.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.