Concepts of ETL learning notes

Source: Internet
Author: User
Tags ssis

There are two links:
ETL, short for extraction-transformation-loading, is the process of data extraction (extract), transformation (Transform), and load. It is an important part of building a data warehouse.

ETL is a process of extracting, cleaning, and transforming data from a business system and loading it into a data warehouse. It aims to integrate scattered, disorderly, and standardized data in an enterprise, provides analysis basis for enterprise decision-making. ETL is an important part of Bi projects. In general, ETL in Bi projects takes 1/3 of the time of the entire project. The quality of ETL design is directly related to the success or failure of Bi projects.

In the construction of a data warehouse, ETL runs throughout the project. It is the lifeline of the entire data warehouse, including data cleansing, integration, conversion, and loading. If the data warehouse is a building, ETL is the foundation of the building. The quality of ETL data extraction and integration directly affects the presentation of final results. Therefore, ETL plays a key role in the entire data warehouse project and must be placed in a very important position.

Through ETL, we can generate a Data Warehouse Based on the data in the source system. ETL builds a bridge between the OLTP system and the OLAP system. It is a channel for data flowing from the source system to the data warehouse. In the project implementation of the data warehouse, it is related to the data quality of the entire project. Therefore, you must place the data warehouse in an important place and build the foundation of the data warehouse!

ETL mainly utilizes the processing capability of the conversion server to extract data from the source table, clean and convert the data on the Conversion server, and load the data to the target database. The conversion process is carried out on the Conversion server, so its performance bottleneck lies in the intermediate conversion server.

ETL is designed in three parts: data extraction, data cleaning and conversion, and data loading. When designing ETL, we also start from these three parts. Data Extraction is from different data sources to ODS (operationaldatastore, operational data storage)-this process can also be used for data cleansing and conversion ), different extraction methods need to be selected during the extraction process to improve the ETL operation efficiency as much as possible. Among the three ETL parts, the most time-consuming part is the "T" (cleaning and conversion) part. In general, this part of the workload is 2/3 of the total ETL. Data loading is generally written directly to DW (datawarehousing) after data cleansing.

There are multiple ETL implementation methods and three are commonly used. One is implemented using ETL tools (such as Oracle owb, sqlserver2000 DTS, sqlserver2005 SSIS service, and informatic), and the other is implemented using SQL, the other is the combination of ETL tools and SQL. The first two methods have their own advantages and disadvantages. With tools, you can quickly establish an ETL project, shielding complex coding tasks, improving the speed and reducing the difficulty, but lack flexibility. The SQL method has the advantages of being flexible and improving the ETL operation efficiency. However, the coding is complicated and the technical requirements are relatively high. The third is to combine the advantages of the first two methods to greatly improve the ETL development speed and efficiency.

I. Data Extraction

This part requires a lot of work in the research phase. First, we need to figure out the data from several business systems, what DBMS is run on the database servers of each business system, and whether manual data exists, the amount of manual data and whether unstructured data exists can be extracted only after the information is collected.
1. The same data source processing method as the database system storing DW
This type of data source is easy to design. In general, DBMS (sqlserver, Oracle) will provide the database link function, establish a direct link between the DW database server and the original business system, you can write the SELECT statement for direct access.

2. Processing Methods for data sources different from DW Database Systems
For this type of data source, you can also establish a database connection through ODBC, for example, between sqlserver and Oracle. If you cannot establish a data warehouse chain, you can create two methods. One is to export source data to. txt files through tools, and then import these source system files to ODS. Another method isProgramInterface.

3rd, for the file ((((.txt,.xls), you can train business personnel to use database tools to import the data to the specified database, and then extract the data from the specified database. You can also use tools to import the flat data sources, flat targets, and other components of the SSIS service of sqlserver2005 to ODS.

4. incremental updates
For systems with large data volumes, incremental extraction must be considered. In general, the business system records the time when the business occurs. We can use the incremental marker to determine the maximum time recorded in ODS before each extraction, then, the business system obtains all records later than the time. The timestamp of the Business System is used. Generally, the business system does not have or has a timestamp.

Ii. data cleaning and conversion

Generally, data warehouses are divided into ODS and DW. The common practice is to clean from the Business System to ODS, filter out dirty data and incomplete data, and convert from ODS to DW, compute and aggregate some business rules.

1. Data cleansing
the task of data cleansing is to filter the non-conforming data and send the filtered results to the competent business department, confirm whether to filter out or make corrections by the business unit before extraction. Non-conforming data mainly includes incomplete data, incorrect data, and repeated data.
(1) incomplete data: This type of data mainly involves missing information, for example, the supplier name, the branch name, the customer's region information is missing, and the master table and the detail table in the business system cannot match. When this type of data is filtered out, different Excel files are written to the customer based on the missing content, which requires completion within the specified time. Data Warehouse is written only after completion.
(2) incorrect data: This type of error occurs because the business system is not sound enough and is directly written to the background database after receiving the input, for example, the numeric data is converted into full-angle numeric characters, the string data is followed by a carriage return operation, the date format is incorrect, and the date is out of bounds. This type of data also needs to be classified. For problems such as full-width characters and invisible characters before and after data, you can only find this type of data by writing SQL statements, then, the customer is requested to extract the data after the business system correction. Incorrect date format or out-of-date errors may cause ETL operation failure. This type of error needs to be identified by SQL in the Business System database, submit the request to the competent business department for correction within a time limit, and then extract the correction.
(3) duplicate data: This type of data, especially in the dimension table, is used to export all fields of the record, so that the customer can confirm and organize the data.
data cleansing is an iterative process that cannot be completed within a few days. It is only possible to continuously discover and solve problems. For filtering or correction, the customer is generally required to confirm whether the filtered data is written into an Excel file or the filtered data is written into a data table, at the initial stage of ETL development, you can send an email to the business unit to filter data, prompting them to correct errors as soon as possible, and can also serve as a basis for future data verification. During data cleansing, do not filter out useful data. verify each Filtering Rule carefully and confirm it with the user.

2. Data Conversion
Data conversion tasks mainly perform inconsistent data conversion, data granularity conversion, and calculation of some business rules.
(1) Inconsistent Data Conversion: this process is an integrated process that unifies the same types of data in different business systems. For example, the Code of the same supplier in the settlement system is xx0001, in CRM, the encoding is yy0001, which is converted into an encoding after extraction.
(2) data granularity conversion: business systems generally store very detailed data, and data in data warehouses is used for analysis without the need for very detailed data. Generally, data in the business system is aggregated according to the Data Warehouse granularity.
(3) Calculation of business rules: different enterprises have different business rules and different data indicators. These indicators can be completed without simple addition, subtraction, and deduction, in this case, these data indicators need to be computed in ETL and stored in the data warehouse for analysis and use.

3. ETL log and warning sending

1. ETL logs
ETL logs are classified into three types. The first is the execution process log, which is a record of every step in the ETL execution process. It records the start time of each step, the number of rows of data affected, and the billing method. The first type is error logs. When an error occurs in a module, the error logs are written to record the time of each error, the error module, and the error information. The third type of log is the overall log, which only records the ETL start time and end time. If you use the ETL tool, the ETL Tool automatically generates some logs, which can also be part of the ETL log. The purpose of logging is to know the ETL running status at any time. If an error occurs, you can know where the error occurs.

2. Send a warning
If an ETL error occurs, you must not only create an ETL error log, but also send a warning to the system administrator. There are many ways to send alerts. Generally, you can send emails to the system administrator and attach error messages to the system administrator to facilitate troubleshooting.
ETL is a key part of Bi projects and a long-term process. Only by constantly discovering and solving problems can ETL be more efficient and provide accurate data for later development of Bi projects.

4. added the following understanding of ETL:
1. Data collection:
A) diverse collection locations: different servers and locations;
B) various data collection formats: Different Storage files and different storage formats;
 

2. Analysis and conversion:
A) analyze and parse different files and data formats;
B) convert data of different formats into a uniform format;
C) merged data;

3. Import unified format data to the data warehouse.

If the model design of the data warehouse is the blueprint of a building and the data is brick, ETL is the process of building. The most difficult part of the project is the user requirement analysis and model design, while the ETL rule design and implementation are the largest workload, accounting for about 60% of the total project ~ 80%. This is a general consensus obtained from many practices at home and abroad.
Currently, typical ETL tools include: Microsoft SSIS (replacing the original DTS), Informatica, datastage, Oracle owb, Odi,In addition, Sun There is also a complete set ETL Tool. Open-source tools include s etl plug-ins.
ETL quality problems are embodied in the following features: correctness, integrity, consistency, completeness, effectiveness, timeliness, and availability.

To better implement ETL, we recommend that you pay attention to the following points during the ETL implementation process:
First, ensure the efficiency of integration and loading, and analyze and summarize the data in a distributed manner;
Second, enhance controllability and actively collect data instead of passively pushing data;
Third, process-based configuration management and standard agreements should be formulated;
Fourth, develop key data standards and non-key data standards.

Four data ETL modes:
1) completely Refresh: only the latest data is included in the data warehouse data table. The original data is deleted each time the data is loaded, and the latest source data is fully loaded. In this mode, the data extraction program extracts all records from the source data, clears the target data table Before loading, and then loads all records. To speed up data deletion, truncate is generally used to clear data tables. This mode is used for the current information table in this system.

2) image increment: records in the source data are updated on a regular basis, but the records include the record time field. Records of historical data are saved in the source data, ETL can extract incremental data from the source data and load it to the data warehouse in an additional way. The historical data records are also stored in the data warehouse.

3) event increment: each record is a new event and there is no inevitable relationship between each other. The new record is not a change to the original record value. The record includes the time field, the new data can be extracted and loaded to the database using the time field.

4) image comparison: the data warehouse data has an effective date field to save the historical information of the data. The source data is not retained and may be updated every day. Therefore, you can only compare the new image data with the image of the last loaded data to find the change part, update the historical data to the effective end date of the updated record, and add the changed data. A dimension table that stores historical information in most data sources.

Reference Source: http://blog.9color.cn/author/fengbuyu/

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.