ETL Considerations
As a data warehouse system, ETL is the key link. Said Big, ETL is a data integration solution, said small, is to pour data tools. Recall the work over the years, the processing of data migration, conversion is really a lot of work. But those jobs are basically a one-time job or a small amount of data, using Access, DTS, or making a small program of your own. However, in the Data Warehouse system, ETL has risen to a certain degree of theoretical height, and the original use of the tool is different. What is different, from the name can be seen, people have been the process of the data is divided into 3 steps, E, T, l respectively represents the extraction, conversion and loading.
In fact, ETL process is the process of data flow, from different data sources to different target data. But in the Data Warehouse, ETL has several characteristics, one is the data synchronization, it is not a disposable data to pull, it is a regular activity, in accordance with the fixed period of operation, and even now some people put forward the concept of real-time ETL. The second is the amount of data, which is generally huge, and it is worthwhile to split the process of data flow into E, T, and L.
Now there are many mature tools to provide ETL functions, such as DataStage, Powermart, etc., and do not say they are good or bad. From the application point of view, the ETL process is not very complex, these tools to the Data Warehouse engineering and great convenience, especially the development of convenience and maintenance convenience. But on the other hand, developers are easily lost in these tools. For example, VB is a very simple language and is also very easy to use programming tools, to get started particularly fast, but the real master of VB how many? Microsoft's products usually have a principle of "treat users as Fools", in this principle, Microsoft's things are very useful, but for developers, if you yourself as a fool, it is really silly. The same is true of ETL tools, which provide us with a graphical interface that allows us to focus our efforts on rules to improve development efficiency. From the use of the effect, it is true that these tools can be used to build a job very quickly to process a certain data, but overall, it is not likely that his overall efficiency will be much higher. The problem is not primarily on the tools, but on the design and development staff. They are lost in the tool, not to explore the nature of ETL.
It can be said that these tools have been applied for such a long time, in so many projects, the environment, it must have its success, it must reflect the nature of ETL. If we don't look at the idea behind the simple use of these tools on the surface, what we finally make is a separate job, and it's still a huge amount of work to put them together. We all know that "the combination of Theory and practice", if in a field beyond, must be in the theoretical level to reach a certain height
Exploring the essence of ETL
ETL process is the process of data flow, never with the heterogeneous data source to the unified target data. In the meantime, data extraction, cleaning, conversion and loading form a serial or parallel process. The core of ETL is also the t process, that is, the conversion, and the extraction and loading can generally be used as the input and output of the conversion, or, as a separate component, its complexity is not high conversion components. Unlike an OLTP system, where the INSERT, update, and select operations are filled with this single record, the ETL process is generally batch-based, such as its loading with bulk loading tools, typically a tool that comes with the DBMS system itself, such as Oracle Sqlloader and DB2 autoloader and so on.
ETL itself has some characteristics, in some tools are reflected in the following datastage and Powermart for example.
1, the static ETL unit and the dynamic ETL Unit instance . A conversion indicates how data in one format is formatted into another format, and the physical form of the data source may not be specified at design time, and it can be specified at run time when the ETL unit creates an instance. For static and dynamic ETL units, DataStage is not strictly differentiated, its job is to implement this function, in the earlier version, a job can not run two times, so a job equivalent to an instance, in the later version, it supports multiple instances, And it's not the default option. The two concepts are distinguished by powermart, statically called mapping, and the dynamic runtime is called session.
2,ETL meta-data . Metadata is the data that describes the data, his meaning is very extensive, here refers only to the ETL metadata. It mainly includes the data structure and the rules of transformation before and after each conversion. The ETL metadata also includes the management of formal parameters, the parameters of the ETL element defined by the formal parameters, and the relative arguments, which are the parameters specified at run time, and the arguments are not within the scope of metadata management. 3, the control of data flow . There is a visual process editing tool that provides process definition and process monitoring capabilities. The smallest unit of process scheduling is the ETL unit instance, ETL unit is not in the subdivision of the ETL process, of course, this is controlled by the developer, for example, can be extracted, the transformation is placed in an ETL unit, so that the extraction and conversion can only run concurrently, and if they are divided into two units, can be run separately, This facilitates error recovery operations. Of course, the ETL unit should be subdivided to what extent should be based on the specific application to see, there is no good subdivision strategy. For example, we can specify the function of loading a table as an ETL unit, but it is undeniable that there will be a lot of common operations between such ETL units, such as two units to share a hash table, the hash table to be loaded into memory two times. 4, the definition method of the conversion rule . Provides a common rule method for a set of functions, providing a rule definition language description rule.
5. Quick index of data . In general, using the hash technique, the reference relational table is loaded into memory in advance, and the hash table is found during the conversion. DataStage has a hash file technology, Powermart also has a similar lookup function.
exploring the nature of ETL II (classification )
read a report on It-director yesterday, on the classification of ETL products. In general, the ETL tools in our eyes are expensive, able to handle a huge amount of data, but this is one of the guys. It can be divided into 4 kinds, according to different requirements, mainly from the complexity of the conversion rules and the size of the data. They include :
1, Interactive operating environment . You can specify the data source, target data, specify rules, and immediately ETL. This interactive operation is undoubtedly very convenient, but only suitable for small data volume and the complexity of the ETL process is not high, because once the rules are complex, may require a language-level description, can not simply drag and drop. There is also the problem of data volume, which is bound to be based on the interpretation of the language, in addition to his flexibility must be at the expense of a certain performance. So if we're going to be dealing with massive amounts of data, every time a record is read, every time the rule is interpreted and executed, each time a record is written, the performance impact is significant.
2, specially coded type . It provides a program framework based on a language, you can not have to focus on some of the peripheral functions, such as the ability to read files, write the database, and focus on the implementation of the rules. The performance of this approximate manual code is certainly nothing to say, unless your programming skills do not pass through (this is one of the factors that cannot be ignored). The ETL implementation of this approach is very straightforward for processing large data volumes and processing complex transformation logic.
3, the code generator type . It is like an ETL code generator, provide a simple graphical interface operation, let you drag and drop the conversion rules are set up, in fact, his background is to generate a program based on a language, to run this ETL process, must be compiled before the line. DataStage is a product like this, the design of the job must be compiled, which avoids the interpretation of each conversion to execute, but do not know what the intermediate language it produces. In the past I designed the ETL tool big moving is also attributed to this category, it provides the interface for users to write rules, the final generation of C + + language, compiled can be run. This type of tool is characterized by hard work on the interface, it is necessary for users to easily define an ETL process, provide a rich plug-in to complete the reading, writing and conversion functions. Big moved in this area is too weak, rules must be handwritten, and to write standard C + + syntax, this is still a bit difficult for the end user, it is better to make a professional code-based products. On the other hand, such tools must provide functionality for expert applications, because it is not possible to take into account all the conversion rules and all read and write, providing plug-in interfaces to allow third parties to write specific plugins, on the other, and providing specific languages for advanced functionality. For example, DataStage provides a class-basic language, but the scripting implementation of his job does not seem to be doing so well, it can only draw jobs manually, and not programmatically.
4. Finally, there is a type called a data hub . As the name implies, he works like a hub. The classification of this type and the above categories in the standard differences, above three more refers to the ETL implementation method, which mainly from the data processing angle. At present, some products belong to EAI (Enterprise application integration), its data integration is a quasi-real-time nature. So this kind of product is like hub, continuously receives the data of heterogeneous data source, after processing, in implementation sends to different target data.
Although these classes seem to be different, especially in the BI project, in the face of the huge data ETL, the middle of two choices began, in the selection process, must take into account the development efficiency, maintenance, performance, learning curve, personnel skills and other factors, of course, the most important is the most realistic factor is the customer's image.
Exploring the essence of ETL three (conversion )
ETL explored one of the most complicated part of the ETL process is T, the conversion process, the T process what type?
I. Macro input and OUTPUT
from the data source of the entire macro processing points, look at an ETL process input and output, can be divided into the following categories:
1, the size of the turn . This processing is common in the data cleansing process, such as from the data source to the ODS stage, if the Data Warehouse adopts dimension modeling, and the dimension basically uses the surrogate key, there must be a code to this key value conversion. If SQL is implemented, it is necessary to join a large table and a bunch of small tables, of course, if the use of ETL tools, the general is to read the small table into memory before processing. In this case, the output data is the same granularity as the large table.
2, big hand . The association between large tables and large tables is also an important topic, of course, there must be a main table, logically, should be the main table left join auxiliary table. The biggest problem of the association between large tables is performance and stability, for the massive data, there must be an optimized method to handle their association, in addition, the processing of big data will undoubtedly take up too much system resources, the probability of error is very large, how to achieve effective error recovery is also a problem. In this case, we recommend that you try to split the large table into moderately smaller tables, forming the type of the size intersection. The output data granularity of such cases is the same as for the main table.
3. stand in and lie out . In order to improve the system flexibility and expansibility, much information is maintained in the Code table, so its "fact table" is a narrow table, and in the Data warehouse, it is usually to be widened, from row to column, so this kind of processing is called "stand in and lie out". We are certainly not unfamiliar with decode, which is one of the common means of wide-form. The process of narrowing tables is mainly reflected in the operation of the Code field in the narrow table. In this case, the narrow table is the input, the wide table is the output, the granularity of the wide table must be thicker than the narrow table, it is coarse in that Code field.
4, gather . The important task in data Warehouse is to precipitate data, aggregation is an essential operation, it is the process of coarse data granularity. Aggregation itself is very simple, like the group by operation in SQL, select a specific field (dimension), and then use some kind of aggregation function for the Measure field. However, the optimization of aggregation algorithm is still a subject of inquiry in the case of large data volume. For example, a group by that uses SQL directly, or a sort before processing.
Second, the micro-rules
From the micro-details of the conversion of data, can be divided into the following basic types, of course, there are a number of complex combinations, such as the first operation, in the reference to the rules of the conversion, this based on the basic type combination of the case is not. ETL rules are dependent on the target data, the target data has how many fields, there are many rules.
1, Direct mapping . The original is what is what, copied intact, on such a rule, if the data source field and the target field length or precision does not match, you need to pay special attention to see if you can really directly map or need to do some simple operation.
2, Field Operation . A field in which one or more fields of the data source are mathematically calculated, typically for numeric fields.
3, reference conversion . In a conversion, you typically use one or more fields of the data source as keys, go to an associative array to search for a specific value, and you should only get unique values. This associative array using hash algorithm implementation is more appropriate and most common, before the entire ETL started, it loaded into memory, the performance of the help is very large.
4, string processing . You can often get specific information, such as a social security number, from a string field in a data source. Also, numeric values are often represented as strings. Operations on strings typically have type conversions, string interception, and so on. However, due to the randomness of the character type field also creates the hidden danger of dirty data, so when dealing with this rule, you must add exception handling.
5, empty value judgment . is the processing of null values a common problem in the Data warehouse, as dirty data or as a specific dimension member? I'm afraid it depends on the application, and it needs to be explored further. However, for fields that may have null values, do not take the "direct mapping" rule type, you must judge the null value, and our recommendation is to convert it to a specific value at this time.
6, date conversion . Date values in the Data warehouse typically have a specific, different representation of a date type value, such as using a 8-bit integer 20040801 to represent a date. In the data source, this field is basically a date type, so for such a rule, some common functions are required to handle converting dates to 8-bit date values, 6-bit month values, and so on.
7, date arithmetic . Based on the date, we usually calculate the day difference, the month difference, the length and so on. The date arithmetic functions provided by the general database are based on the date type, while in the Data warehouse the date is represented by a specific type, and a set of its own date operands is required.
8, aggregation operation . For the measure fields in the fact table, they are usually derived from one or more fields of the data source using the aggregation function, which is included in the SQL standard, including Sum,count,avg,min,max.
9, the established value . The difference between this rule and the above-mentioned types of rules is that it does not depend on the data source field, and takes a fixed or system-dependent value on the Target field.
Explore the essence of ETL four (data quality )
"Don't make absolute data accurate, but know why it's not accurate. ”
This is our requirement for data accuracy in building a bi system. Indeed, the absolute accuracy of the data who is not sure, not only the system integrator, including the customer is also unable to determine. Accurate things need a standard, but first of all to ensure that the standard is accurate, at least now there is no such a standard. The customer will present a relative standard, such as comparing your OLAP data results with the report results. Although this is a less fair comparison, you have to admit it.
It is also true that data quality is difficult to guarantee first in the data source. What are the possible causes of data quality issues at this level? Can be divided into the following categories:
1, the data format is wrong . For example, missing data, data values out of range, or data formats are illegal. You know that for a data source system that also handles large amounts of data, they often discard some of the database's own inspection mechanisms, such as field constraints. They ensure that the data is checked as much as possible before warehousing, but this is difficult to ensure. Such cases are such as identity card number, mobile phone number, date field of non-date type, and so on.
2, data consistency . Similarly, data source systems, for performance reasons, discard foreign key constraints to some extent, which often results in inconsistent data. For example, a user ID that is not in the user table appears in the billing table, such as some code that is not found in the Code table.
3, the rationality of business logic . It's hard to say right from wrong. In general, the design of the data source system is not very rigorous, such as allowing users to open a date later than the user pin date is likely to occur, a user table has multiple user IDs is also possible. Is there any way to do that?
Building a bi system is simply impossible to fully understand the data source system. In particular, the data source system after delivery, more maintenance personnel to improvise, it is to spend a lot of time to find out why. It has been argued that designers have a problem with the rules, and it is suggested that all the rules must be made clear before ETL starts. I do not agree with this opinion, but I think in the ETL process should have to deal with these quality problem data assurance. Be sure to front these dirty data, is discard or processing, can not escape. If there is no quality assurance, then in this process, the error will gradually enlarge, put aside the data source quality problems, we look at the ETL process which factors have a significant impact on data accuracy.
1. The rules describe errors . The above mentioned that the designer's understanding of the data source system is inadequate, resulting in a rule understanding error, which is one aspect. On the other hand, it is the description of the rule, if there is no ambiguity to describe the rule is also a subject to be explored. The rule is attached to the target field, and in the Quest three, the classification of the rules is mentioned. But the rules can not always be described in words, must have a strict mathematical expression. I have even thought that if the designer can use a certain rule language to describe, then our ETL unit can be automatically generated, synchronized, save a lot of manual operation.
2,ETL development error . Real-time rules are very clear, ETL development process will also occur some errors, such as logic errors, writing errors and so on. For example, for a segmented value, the open interval closed interval is required, but often the developer does not notice that a greater than equals sign written greater than the number results in a data error.
3, human processing error . In the overall ETL process is not completed, in order to save the diagram, usually manually run the ETL process, one of the major problem is that you will not follow the normal process to run, but according to their own understanding to run, the error may be mistakenly deleted data, repeated loading data and so on.
Explore the essence of ETL five (Quality assurance ) The last mention of ETL data quality problems, this is not radical, can only take specific means to avoid, and must define a measure to measure the quality of the data is good or bad. For the quality of the data source, the customer should be more concerned about this, if there is no guarantee of relatively clean data at this source, then the credibility of the subsequent analysis function is also problematic. The data source system is also evolving, and the customer's operations are gradually being regulated, as are the BI systems. This paper discusses how to deal with the quality of data source and ETL processing.
How to deal with the quality problem of data source? Remember in the Onteldatastage list also discussed a topic-"Processing of 1", in the Data Warehouse Model dimension table, there is usually a 1 record, indicating "unknown", this unknown meaning can be wide, any possible error data, NULL data even the rules are not covered by the data, All turn to-1. This is a way to deal with dirty data, but it is also a way to conceal the facts. It's like writing a function FileOpen (filename), returning an error code, of course, you can just return an error code, such as-1, but this is a bad design, for the caller, he needs to make some judgments based on this error code, such as whether the file does not exist, or read permission is not enough, have the corresponding processing logic. The same is true in the Data warehouse, so it is recommended to convert different data quality type processing results to different values, for example, after conversion, 1 means no reference, 2 for null data, and so on. But this only dealt with the first type of error mentioned last, the data format is wrong. For data consistency and business logic rationality, this remains to be explored. But there is a principle that "the quality of the data source must be reflected in the Data Warehouse".
for the ETL process to produce quality problems, must have a guarantee means. From the past experience, there is no guarantee means to the implementation of the trouble. Implementation personnel for the repeated loading data must not be unfamiliar, even the final data left to the last cube, only to find the first step ETL is actually wrong. This safeguard is the data verification mechanism, of course, its purpose is to be able to monitor the quality of data in the ETL process, generate alarms. This module will be implemented as an end-user, it can be said that they are the direct benefit of the data validation mechanism.
First of all, there must be a measure of quality, what is high quality what is low quality, can not rely on sensory feeling, but this is in the absence of measurement method of the usual practice. That management analysis system, Unicom headquarters has put forward testing specifications, which is actually a measure, such as the error range of the indicator can not be higher than 5%, for the system itself must have such a measure, first of all, do not say whether this measurement method is scientific. For ETL data processing quality, his measurement method should be more stringent than the method defined by the Unicom Headquarters Test specification, because he more the BI system as a black box, from the data source to show the error of the data to allow a certain error. and ETL data processing quality measurement is a white box measure, should pay attention to each step of the process. Therefore, in theory, the indicators that require input and output should be exactly the same. But we have to be completely consistent, just ideal, for data with errors, we must find the cause.
A data validation framework can be established under the premise of the quality measurement method. This framework is based on the method of total and component data audit, which has been pointed out in the article of high Data Warehouse audit technology. As a supplement, here are some functional recommendations:
1, provide the front-end . Use the developer as a user, as well as a friendly user interface. In the article "Audit Technology", it is pointed out that the form of test report, this form still depends on human judgment, in a lot of data to find the law. Instead of providing an interface in an OLAP way, it is not just a measure of the results of the test statistics, but also the calculation of the measurement method. For example, the error rate, the error rate is greater than 0 of the indicator, it is necessary to check the reasons.
2, provide the framework . Data validation is not a one-time job, but must be done in every ETL process. Therefore, there must be a framework that automates the validation process and provides an extension to enable implementers to increase the scope of validation. With such a framework, in fact, it plays the role of standardized operations, development implementers can focus on the writing of the verification script, rather than focus on how to validate the integration into the process, how to show the work. To this end, to design a set of tables, similar to the DM table, each validation results data are recorded, and automatically trigger the multidimensional analysis of data loading, publishing and so on. This allows the implementers to observe the error rate of the data at each load and even during the process. In particular, if the Data warehouse model can be unified, and even data validation scripts can be identified, the rest is the specification process.
3, standardize the process . Last mentioned that there is an ETL data quality problem is caused by manual processing, the most important reason is the process is not standardized. It is convenient for development implementers to run a single ETL unit, although it has been suggested that an ETL unit must be "reentrant", which can solve the problem of mistakenly deleting data and loading data repeatedly. But remember that data validation is also in the process, so that the data can be verified to work on a daily basis, do not let the implementers feel his presence. In general, the normative process is the key work to improve the efficiency of implementation, which is also to continue to explore.
exploring the essence of ETL VI (meta Data ramble )
There is nothing particularly exciting about the definition of metadata (Metadata), which is a very broad concept, and is generally defined as "metadata is the data that describes the data ", This creates a recursive definition, like asking where Xiao Qiang lives, answer, next to Wang Choi. According to this definition, what is the data that the metadata describes? or meta-data. This can be a meta-element ... Meta data. I have also heard of a meta-data, if the data is a drawer file, then the metadata is classified label. What's the difference between it and the index?
Metadata is an abstraction, and philosophers have been abstracting the world since ancient times, trying to find the essence of the world. Abstraction is not a layer of relationships, it is a gradual process from concrete to general. For example, I--man-------------creature of mammals----this is an abstract process, and if you find this example very common in the software industry, an object-oriented approach is an abstraction. It abstracts things and processes in the world, and constructs a set of object models using object-oriented methods. Similarly, in object-oriented methods, a class is an abstraction of an object, and an interface is an abstraction of a class. Therefore, I think it is possible to change the "meta" and "abstract", that is to say, abstract data is not a good understanding of some.
often hear such words, "XX leader's speech strategically advantageous position, give us the direction of the work behind the clear directions", the idiom "strategically advantageous position", standing on the 10 floor down water, condescending, can hit the dead, this is to stand at a certain height to see things, this certain height is that he has enough "yuan." In design mode, the emphasis is on programming the interface, that is, you do not handle the interaction of such objects and objects, but to deal with the interface and the interaction of that interface, let alone what they do inside.
The meaning of meta-data is also in this, although the above said a pass to philosophy, but the word must still be combined with software design, I do not know in other areas is not the existence of metadata such a term, although I believe that other areas must have similar east. The existence of metadata is to achieve a higher level of abstraction in the design of software. This must be good, what flexibility ah, extensibility Ah, maintainability Ah, can be improved, and the structure is clear, just curved too much, if viewed from the bottom, too complex. Long ago, I have seen BackOrifice code, I am, a simple function, from this class to the parent class, and go to the parent class, I do not understand why a simple function is not implemented in the method of a class to pull it? Now think about it, this is not true, although it makes the code easy to understand, but the structure is really chaotic, then he can only do now, if there is any function extension, the code is useless.
I started working from 98 to touch the concept of metadata, then called the metadata-driven system architecture, and later in Qidss also used this concept to build qinavigator, but now feel that the metadata is nothing, is to build a pile of tables to describe the elements of the interface, Then use this data to automatically generate the interface. In the Data Warehouse system, this concept is stronger, is an important part of the Data Warehouse. But to this day, I still think that the concept is too iffy, see the actual things, there are some metadata management things, but from the application of the situation is learned that the use of not much. The reason why iffy, is because the level of abstraction is not clear, the key is the classification of metadata (this classification is an abstract process) and the use of metadata. You can abstract meta data into 0 and 1, but is that useful for your business? must also be abstracted to the appropriate level, the final question is "degree".
How does the metadata of the Data warehouse system work? Is it just to make the system run automatically and be easy to manage? To do this, there is no need to abstract the system to Taiji, two instruments, gossip and the like, the industry has also defined some metadata specifications, to Cwm, XMI and so on, can learn from, but I am not proficient in the said, later.
Experience summary of ETL