ETL Learning Experience: Exploring the essence of ETL in Key Data Warehouses

Source: Internet
Author: User
As a data warehouse system, ETL is a key link. If it is big, ETL is a data integration solution. If it is small, it is a tool for data dumping. Recall that there have been a lot of data migration and transformation operations over the past few years. However, the work is basically a one-time job or a small amount of data. You can use access, DTS, or compile a small program on your own. However, in the data warehouse system, ETL has risen to a certain theoretical level, which is different from the previous use of small tools. What is the difference? We can see from the name that people have divided the inverted data process into three steps. E, T, and l represent extraction, conversion, and loading respectively.
In fact, the ETL process is the process of data flow, from different data sources to different target data. However, in a data warehouse, ETL has several features. One is data synchronization, which is not pulled after data is dumped all at once. It is a regular activity and runs at a fixed cycle, some people have even proposed the concept of real-time ETL. Second, the data volume is usually huge. It is worth splitting the data flow process into E, T, and L.
There are many mature tools that provide ETL functions, such as datastage and powermart. From the application perspective, the ETL process is not very complex. These tools bring great convenience to the Data Warehouse project, especially the convenience of development and maintenance. On the other hand, developers are easily lost in these tools. For example, VB is a very simple language and is also a very easy-to-use programming tool. It is very quick to get started, but how many real VB experts are there? A general principle of Microsoft products is to treat users as fools. Under this principle, Microsoft products are very useful, but for developers, if you treat yourself as a fool, it's really silly. The same is true for ETL tools. These tools provide us with graphical interfaces, allowing us to focus on rules to improve development efficiency. In terms of effectiveness, these tools can be used to quickly construct a job to process a specific data, but the overall efficiency is not necessarily high. The problem lies not in tools, but in design and development personnel. They are lost in tools and have not explored the essence of ETL.
It can be said that these tools have been applied for such a long time. In so many projects and environments, it must have its successes. It must reflect the essence of ETL. If we don't look at the ideas behind these tools through simple use, what we finally make is an independent job, and there is still a huge workload to integrate them. As we all know, "combining theory with practice" must be at a theoretical level if it is beyond the limit of one field.

Exploring the essence of ETL
The ETL process is the process of data flow, from different Heterogeneous Data sources to a unified target data. In the meantime, data extraction, cleaning, conversion and loading form a serial or parallel process. The core of ETL is T, that is, conversion. Extraction and loading can generally be used as the input and output of conversion, or as a separate component, its complexity is not high in conversion components. Unlike the OLTP system, operations such as insert, update, and select that are filled with this single record are generally batch operations in the ETL process. For example, the batch loading tool is usually used for loading, it is generally a tool attached to the DBMS system, such as Oracle sqlloader and DB2 autoloader.
 
ETL has some characteristics and is embodied in some tools. The following uses datastage and powermart as an example.
 
1. Static ETL units and dynamic ETL unit instances. One conversion specifies how data in a certain format is formatted into data in another format, you do not need to specify the physical form of the data source during design. It can be specified only when this ETL unit creates an instance at runtime. For static and dynamic ETL units, datastage is not strictly differentiated. One job of datastage implements this function. In earlier versions, a job cannot run twice at the same time, therefore, a job is equivalent to an instance. In later versions, it supports multiple instances and is not the default option. Powermart distinguishes these two concepts. Static is called ing and dynamic is called session.
 
2. ETL metadata. Metadata refers to the data that describes the data. It has a wide range of meanings. It only refers to the metadata of ETL. It mainly includes the data structure and conversion rules before and after each conversion. ETL metadata also includes management of formal parameters, ETL unit-defined parameters of formal parameters, and real parameters. They are parameters specified during runtime, and real parameters are not within the scope of metadata management.

3. Data Process Control. A Visual Process editing tool should be provided to provide process definition and process monitoring functions. The smallest unit of process scheduling is an ETL unit instance, and the ETL unit cannot be subdivided into ETL processes. Of course, this is controlled by developers. For example, extraction and conversion can be placed in an ETL unit, in this way, the extraction and conversion can only run at the same time. If you divide them into two units, you can run them separately, which is conducive to error recovery. Of course, the extent to which the ETL unit should be subdivided depends on the specific application. At present, no good subdivision strategy has been found. For example, we can define the function of loading a table as an ETL unit, but it is undeniable that there are many common operations between such ETL units. For example, two units share a hash table, load the hash table into memory twice.

4. Define conversion rules; provide common rule methods for function sets, and provide rule definition language description rules.
 
5. Fast data indexing. Generally, the Hash technology is used to load the reference relational table into memory in advance and query the hash table during conversion. Datastage has the hash file technology, and powermart has similar lookup functions.

Exploring the essence of etl ii (classification)
Read a report on it-director about ETL product classification. In general, ETL tools in our eyes are expensive and can process massive data, but this is one of them. It can be divided into four types for different needs, mainly from the complexity of the conversion rules and the size of the data volume. They include
1. In the interactive running environment, you can specify the data source and target data, specify rules, and ETL immediately. This interactive operation is undoubtedly very convenient, but it can only be suitable for ETL processes with small data volumes and low complexity, because once the rules are complex, it may require a language-level description, it is not easy to drag and drop. There is also the problem of data size. This kind of interaction must be based on the explanatory language, and its flexibility must be at the cost of certain performance. Therefore, if you want to process massive data, read a record each time, interpret and execute the rule each time, and write a record each time, which has a very huge impact on performance.
2. Special encoding type. It provides a program framework based on a certain language. You don't have to focus on some peripheral functions, for example, the file reading and database writing functions focus on the implementation of rules. The performance of similar manual code is nothing to say, unless your programming skills are not enough (this is also one of the factors that can not be ignored ). This ETL method is very intuitive for processing large data volumes and complex transformation logic.
3. The code generator type is like an ETL code generator. It provides simple graphical interface operations, allowing you to drag and drop to set conversion rules, in fact, the background generates programs based on a certain language. To run this ETL process, you must compile the program. Datastage is a product similar to this. A well-designed job must be compiled, which avoids the execution of each conversion interpretation, but does not know what the intermediate language it generates. The ETL tool I designed previously belongs to this category. It provides an interface for users to write rules and finally generate the C ++ language, which can be run after compilation. This type of tool requires a great deal of effort on the interface. You must easily define an ETL process and provide a wide range of plug-ins to complete reading, writing, and conversion functions. Big migration is too weak in this regard. The rules must be handwritten and must be written into the Standard C ++ syntax, which is a little difficult for end users, it is better to make a professional encoding product. In addition, such tools must provide expert application-oriented functions, because they cannot take into account all conversion rules and all reads and writes. On the one hand, they provide plug-in interfaces for third parties to write specific plug-ins, on the other hand, it also provides specific languages for advanced functions. For example, datastage provides a basic language. However, it seems that the script implementation of his job is not very good. He can only draw jobs manually, but cannot program the job.
4. Another type is data hub. As the name suggests, it works like a hub. This type is different from the above classification standards. The above three more refer to the ETL implementation methods, which are mainly from the data processing perspective. At present, some products belong to the EAI (Enterprise Application Integration), and its data integration is mainly quasi-real-time. Therefore, such products, like the hub, constantly receive data from various heterogeneous data sources. After processing, they are delivered to different target data.
Although these classes seem to be different and different, especially in Bi projects, in the face of massive data ETL, the two options start. In the selection process, the development efficiency, maintenance, performance, learning curve, personnel skills, and other factors must be taken into account. Of course, the most important and realistic factor is the customer's image.

Exploring the essence of ETL 3 (transformation)
As mentioned in one of the ETL explorations, T is the most complex part of the ETL process. What types of t process are involved in this conversion process?
I. Macro Input and Output
The input and output of an ETL process can be divided into the following categories:
1. Large/small transactions. This kind of processing is common in the data cleansing process. For example, from the data source to the ODS stage, if the data warehouse uses dimensional modeling and the dimension uses the proxy key, there must be a conversion from code to this key value. If SQL is used, a large table and a bunch of small tables must be joined. Of course, if ETL tools are used, small tables are read into the memory before processing. In this case, the granularity of output data is the same as that of a large table.
2. In large tables, association between large tables and large tables is also an important topic. Of course, there must be a primary table. Logically, it should be a left join secondary table of the primary table. The biggest problem with associations between large tables is performance and stability. For massive data volumes, there must be an optimization method to deal with their associations. In addition, processing Big Data will undoubtedly occupy too much system resources, and the chance of errors is very high. How to effectively recover errors is also a problem. In this case, we recommend that you split a large table into a slightly smaller table to form a large table type. In this case, the output data granularity is the same as that in the master table.
3. Come in and lie down. In the transaction system, in order to improve system flexibility and scalability, a lot of information is maintained in the code table, so its "fact table" is a narrow table, and in the data warehouse, we usually need to make the data wide from rows to columns, so this processing is called "standing in, lying down and going out ". You must be familiar with Decode. This is one of the common methods for Wide Table conversion. The narrow table widening process is mainly reflected in the operations on the code field in the narrow table. In this case, the narrow table is the input, and the wide table is the output. the granularity of the Wide table must be thicker than that of the narrow table, which is the same as that of the Code field.
4. aggregation. An important task in a data warehouse is to accumulate data, and aggregation is an essential operation. It is a process of coarse data granularity. Clustering itself is actually very simple. It is similar to the group by operation in SQL. Select a specific field (dimension) and use a clustering function for the measurement field. However, optimization of clustering algorithms is still a topic to explore in the case of large data volumes. For example, whether the SQL group by statement is used directly or the SQL group by statement is sorted first.
Ii. Micro-Rules
From the microscopic details of data conversion, you can divide the data into the following basic types. Of course, there are some complex combinations, such as first calculation. In reference to the conversion rules, this condition is not included in the basic type combination. The ETL rules depend on the target data. The ETL rules depend on the number of fields in the target data.
1. Direct ing. What is the original one? The original one cannot be copied. For such a rule, if the data source field and the target field length or accuracy do not match, pay special attention to whether the ing can be realized directly or some simple operations are required.
2. Field Calculation: The target field obtained by mathematical calculation of one or more fields of the data source. This rule is generally applicable to numeric fields.
3. for conversion by reference, one or more fields of the data source are usually used as keys to search for specific values in an associated array, and only unique values can be obtained. This associated array is suitable and most common to be implemented using the hash algorithm. Before the ETL starts, it loads the memory, which greatly improves the performance.
4. string processing: You can often obtain specific information from a string field of the data source, such as the ID card number. In addition, there are often numeric values in the form of strings. String operations include type conversion and string truncation. However, the randomness of character type fields also causes the risk of dirty data. Therefore, when processing such rules, you must add Exception Processing.
5. null value determination. Processing of null values is a common problem in the Data Warehouse. Is it used as dirty data or as a specific dimension member? This may depend on the application and further exploration is required. However, for fields that may have null values, do not use the "direct ing" rule type. null values must be determined, currently, we recommend that you convert it to a specific value.
6. For date conversion, the date value in the data warehouse generally has a specific date value, which is different from the expression method of the date type value. For example, an 8-bit integer 20040801 is used to represent the date. In the data source, such fields are basically of the date type. Therefore, for such rules, common functions are required to convert a date to an 8-bit date value and a 6-bit month value.
7. Date calculation. Based on the date, we usually calculate the daily, monthly, and duration values. Generally, the date calculation functions provided by the database are based on the date type. If a specific type is used to represent the date in the data warehouse, you must have a set of date calculation functions.
8. Aggregation calculation. For measurement fields in fact tables, they are usually obtained by using clustering functions for one or more fields of the data source. These aggregation functions are in the SQL standard, including sum, count, AVG, Min, Max.
9. the predefined values. The difference between these rules and the preceding rules is that they do not depend on the data source field and take a fixed or system-dependent value for the target field.

Exploring the essence of ETL (Data Quality)
"Never absolutely accurate data, but you need to know why it is not accurate ."
This is our requirement for data accuracy when building a Bi system. Indeed, no one is sure about the accuracy of absolute data. It is not only a system integrator, but also a customer. Accuracy requires a standard, but first we must ensure that this standard is accurate. At least there is no such standard yet. The customer will propose a relative standard, such as comparing your OLAP data results with the report results. Although this is an unfair comparison, you have to recognize it.
 
First, it is also a fact that it is difficult to ensure data quality in the data source. What are the possible causes of data quality problems at this layer? It can be divided into the following categories:
1. Data Format error, such as missing data, data value out of range, or invalid data format. You need to know that for data source systems that process large amounts of data, they usually discard some database inspection mechanisms, such as field constraints. They try to ensure that the data check is performed before the database is imported, but this is difficult to ensure. Such situations include ID card numbers, mobile phone numbers, non-date fields, and so on.
2. Data Consistency. Similarly, for performance considerations, the data source system will discard the foreign key constraint to a certain extent, which usually leads to data inconsistency. For example, a user id not in the User table appears in the account table, for example, some codes cannot be found in the code table.
3. the rationality of the business logic is hard to say right or wrong. Generally, the design of the data source system is not very rigorous. For example, it is possible that the user account date is later than the user account date, and multiple user IDs may exist in a user table. In this case, is there any way?
 
To build a Bi system, it is impossible to fully understand the data source system. In particular, after the data source system is delivered, more maintenance personnel are playing the ad hoc role, which takes a lot of time to find the cause. I have previously argued about the problem that the designer described the rules. Someone suggested that all the rules should be clearly understood before ETL starts. I do not agree with this opinion, but believe that the ETL process should ensure that data with quality problems should be processed. The dirty data must be discarded or processed and cannot be escaped. If there is no quality assurance, errors will gradually increase in this process. Aside from the data source quality issue, let's take a look at which factors in the ETL process have a major impact on data accuracy.
1. The rule description is incorrect. As mentioned above, the insufficient understanding of the data source system by the designer leads to incorrect understanding of the rules. On the other hand, it is the description of rules. If you describe rules without ambiguity, it is also a topic to be explored. Rules are dependent on the target field. In the third part of the search, the classification of rules is mentioned. However, rules cannot always be described in words and must have a strict mathematical expression. I even thought that if a designer can use a rule language to describe, our ETL unit can be automatically generated and synchronized, saving a lot of manual operations.
2. ETL development error. The real-time rules are clear, and some errors may occur during ETL development, such as logical errors and writing errors. For example, for a segment value, the open and closed intervals need to be specified, but developers often do not pay attention to it. If a value greater than or equal to the number is written as greater than the number, data errors will occur.
3. Handle the error manually. Before the overall ETL process is completed, the ETL process is usually run manually to save time. One of the major problems is that you will not run the ETL process as usual, instead, it runs according to your own understanding. errors may be caused by accidental deletion of data and repeated loading of data.

Exploring the essence of ETL (Quality Assurance)
The ETL data quality problem mentioned above cannot be solved. It can only be avoided by specific means, and a measurement method must be defined to determine whether the data quality is good or bad. The customer should be more concerned about the quality of the data source. If clean data cannot be ensured at this source, the reliability of the subsequent analysis functions will become a problem. The data source system is constantly evolving, and customer operations are gradually standardized. The same is true for Bi systems. This article discusses how to deal with data source quality and ETL processing quality.
How to deal with the quality of data sources? Remember to discuss a topic "-1 processing" in the onteldatastage list. In the data warehouse model dimension table, there is usually a-1 record, indicating "unknown ", the unknown meaning is wide. Any data that may be faulty, NULL data, or even data not covered by the rule, is converted to-1. This is a way to process dirty data, but it is also a way to conceal the facts. It is like writing a function fileopen (filename) and returning an error code. Of course, you can only return one error code, such as-1, but this is a bad design. For callers, he must make some judgments based on the error code. For example, if the object does not exist or the read permission is insufficient, the corresponding processing logic is available. The same is true in Data Warehouses. Therefore, we recommend that you convert the processing results of different data quality types to different values. For example, after conversion,-1 indicates that the data cannot be referenced, -2 indicates NULL data. However, this only deals with the first type of errors mentioned above and the data format error. Data Consistency and Business Logic Rationality are still to be explored. However, there is a principle that "the quality of the data source must be reflected in the data warehouse ".
Quality Problems Arising from the ETL process must be guaranteed. From past experiences, there is no protection means to cause a lot of trouble for the implementers. The implementer will not be unfamiliar with the repeated loading of data, or even leave the last data to the final cube, only to find that the first ETL step is actually wrong. This safeguard means is the data verification mechanism. Of course, it aims to monitor data quality and generate alarms during the ETL process. This module treats implementers as end users and can be said to be the direct beneficiaries of the data verification mechanism.
First of all, there must be a quality measurement method. What is high quality? What is low quality? We cannot rely on sensory perception. However, this is a common practice without a measurement method. For the business analysis system, the China Unicom headquarters once proposed a test specification, which is actually a measurement method. For example, the error range of indicators cannot exceed 5%, for the system itself, such a measurement method must be available. Do not say whether the measurement method is scientific or not. For ETL Data Processing Quality, his measurement method should be stricter than that defined by the China Unicom headquarters test specification, because he regards the Bi system as a black box, the data error from the data source to the displayed data allows a certain degree of error. The ETL data processing quality measurement is a white-box measurement that focuses on every step of the process. Therefore, in theory, the input and output indicators must be completely consistent. However, we must be positive and completely consistent, which is only an ideal choice. For data with errors, we must find the cause.
A data verification framework can be established on the premise of the quality measurement method. This framework is based on the total amount and component data audit method. This method has been pointed out in the article "data audit technology in Data Warehouses. As a supplement, the following are some functional suggestions:
1. Provide a front-end. Developers should also provide friendly user interfaces as users. The article audit technology states the form of a test report, which relies on human judgment to find rules in a pile of data. It is better to use OLAP to provide the interface, not only to add the metric results from the test statistics, but also to work with the calculation of the measurement method. For example, if the error rate is greater than 0, check the cause.
2. Provide a framework. Data verification is not a one-time task, but is required in every ETL process. Therefore, there must be a framework that automates the verification process and provides extended means for implementers to increase the scope of verification. With such a framework, it actually plays a role in standardizing operations. developers can focus on writing verification scripts without having to focus too much on how to integrate verification into the process, how to present and so on. To this end, we need to design a set of tables, similar to DM tables. Each verification result data is recorded, and the multi-dimensional analysis data loading and publishing are automatically triggered. In this way, the implementer can observe the data error rate during each loading or even during the process. In particular, if the data warehouse model can be unified and even the data verification script can be determined, the rest is the standard process.
3. Standardize the process. As mentioned above, the ETL data quality problem is caused by manual processing. The main reason is that the process is not standardized. It is very convenient for developers to run a separate ETL unit. Although it was previously recommended that an ETL unit be "reentrant", this can solve the problem of accidental data deletion and repeated data loading. But remember that data verification is also in the process. To make data verification work on a daily basis, do not make the implementer feel that it exists. In general, standardized procedures are the key work to improve implementation efficiency, which will be explored in the future.

Exploring the essence of ETL 6 (metadata mantalk)
The definition of metadata (metadata) is nothing special so far. This concept is very broad and is generally defined in this way. "metadata is the data about data )", this produces a recursive definition, just like asking where Xiaoqiang lives, answering, next to wangcai. According to this definition, what is the data described by metadata? Or metadata. In this way, metadata may exist. I have also heard of metadata. If data is a drawer file, metadata is a classification tag. What is the difference between it and index?
Metadata is an abstraction. philosophers have abstracted the world since ancient times and try to find the essence of the world. Abstraction is not a link. It is a process from concrete to general. For example, "I"> "man"> "man"> "Mammal"> "creature" is an abstract process. If you mix it in the software industry, you will find this example very common, the object-oriented method is such an abstract process. It abstracts things and processes in the world and uses object-oriented methods to build an object model. Similarly, in object-oriented methods, classes are the abstraction of objects, while interfaces are the abstraction of classes. Therefore, I think we can change "meta" and "abstract" to see if abstract data is better understood.
I often hear this saying: "the speech of the XX leaders is advanced, and the clear direction of the work guide behind us". This idiom "advanced construction" stands down on the 10th floor and goes down to the water, it means to view things at a certain height. A certain height means that he has enough "Yuan ". In the design mode, we emphasize interface programming, that is, you should not process the interaction between such objects and those objects, but the interaction between this interface and that interface, don't worry about how they work internally.
The significance of the existence of metadata lies in this. Although the previous discussion has been removed from philosophy, this word must be viewed in conjunction with software design, I don't know if metadata exists in other fields, although I believe there must be something similar in other fields. The existence of metadata is to design software at a higher abstraction layer. This must be advantageous. What flexibility, scalability, and maintainability can all be improved, and the architecture is clear, but it is too complicated. I have read the BackOrifice code a long time ago. I rely on a simple function to switch from this class to the parent class and to the parent class, which is hard to understand, why is a simple function not implemented in a class method? Now let's think about it. This is really not the case. Although it makes the code easy to understand, but the structure is really messy, he can only do what he wants now. If there is any function extension, this code is useless.

I started to understand the concept of metadata when I was working in my 98 years. It was called the metadata-driven system architecture. Later, I used this concept in qidss to build a qinavigator, but now I feel that metadata is useless, isn't it just creating a bunch of table description interface elements, and using the data to automatically generate the interface. In the data warehouse system, this concept is more powerful and is an important part of the data warehouse. But so far, I still think this concept is too mysterious to see the actual things. There are some metadata management things on the market, but I know from the application situation that it is not used much. The reason is that the abstract layers are not clearly divided. The key is the use of metadata classification (which is an abstract process) and metadata. You can abstract the metadata into 0 and 1, but is that useful to your business? It must be abstracted to the appropriate degree, and the final question is "degree ".
What is the role of metadata in the data warehouse system? Isn't it easy to manage the system automatically? To do this, there is no need to abstract the system to Taiji, duyi, and gossip. The industry has also defined some metadata specifications for reference to CWM and XMI.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.