ETL of Data Warehouse--Practical Summary theory

Last Update:2018-08-03 Source: Internet

Author: User

Tags hash perl script

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Etl,extraction-transformation-loading abbreviations, Chinese names are data extraction, conversion, and loading.

Most warehouse-based data architectures can be summarized as:

Data source-->ods (operational datastore)-->DW-->DM (data mart)

ETL throughout its various links.

First, data extraction:

It can be understood that data from the source data is pumped into the ODS or DW.

1. Source Data type:

relational databases, such as oracle,mysql,sqlserver, etc.;

A text file, such as a log file generated by a user browsing a website, data provided by a business system as a document;

Other external data, such as manual data entry;

2. Frequency of Extraction:

Most are taken once a day, also can be based on business requirements per hour or even every minute, of course, should consider the source database system can withstand;

3. Extraction Strategy:

Personal feeling this is the most important part of data extraction, which can be divided into full-scale extraction and incremental extraction.

The full-volume extraction is suitable for those data volume is small, and it is not easy to determine its data changes such as relational tables, dimension tables, configuration tables, etc.;

Incremental extraction, generally due to the large amount of data, it is not possible to use full-scale extraction, or in order to save time to reduce the extraction strategy adopted;

How to determine the increment, which is the hardest part of an incremental extraction, typically includes the following scenarios:

A) The increment is extracted by the time Identification field, and the source data table has a clear flow table that identifies the field of the day's data.

such as createtime,updatetime and so on;

b) Extract the increment based on the self-growth ID recorded at the end of the last extraction, no createtime, but with a self-growing type field,

such as self-growing ID, after extraction, record the maximum ID,

The next extraction can be extracted according to the ID of the last record;

c) through the analysis of database logs to obtain incremental data, no time to identify the field, no self-growth ID of the table in the relational database;

d) Comparing the data of the previous day with the hash of the previous day, this strategy is more complex, described here,

For example, a membership table, its primary key is MemberID, and the status of the member is likely to be updated every day,

After the first extraction, we generate an alternate table A, containing two fields, the first of which is MemberID,

The second is that all the fields except MemberID are stitched together, and then a hash-generated field

At the next extraction time, the source table will be the same processing, generate table B, B and a left, and the hash field is not equal

For the record to change, there is also a part of the new record,

The corresponding records are extracted from the source table according to the MemberID of the two-part records;

e) The incremental data is actively pushed by the source system, e.g. order form, trading table,

Some business systems in the design, when an order status changes, is to go to the source table to do update,

And we need to record all the states of an order in the Data Warehouse,

This is the time to fuss on the source system, database triggers are generally undesirable. The way I can think of is to make some changes in the business system,

When the order status changes, remember a water table, can be written into the database, or can be recorded log files.

Of course there are other extraction strategies, as to which strategy to take into account the source data system situation,

Extracted data in the Data warehouse storage and processing logic, extraction of time window and so on factors.

Second, data cleaning:

As the name implies, it is the processing of unnecessary and non-conforming data. Data cleaning is best placed in the extraction process,

This can save the subsequent calculation and storage costs;

When the source data is a database, the other data extraction SQL can do a lot of data cleansing work.

Data cleansing mainly includes the following aspects:

1. Null value processing, according to business needs, you can replace the null value with a specific value or directly filter out;

2. Verify the correctness of the data, mainly to do not meet the business meaning of the data to do a processing, for example, to put a number of fields in the string

Replace with 0, filter out non-date strings for a date field, and so on;

3. Standardize the format of data, for example, to format all dates into YYYY-MM-DD format, etc.;

4. Data transcoding; A field encoded in a source data is converted into a value that represents its true meaning, through an associative encoding table;

5. Data standards, unity; For example, there are many ways to represent men and women in the source data, and when extracted, they are converted directly according to the values defined in the model.

Uniform representation of men and women;

6. Data cleansing defined by other business rules ...

Third, data conversion and loading:

Many people understand that the ETL is after the first two parts, loaded into the Data Warehouse database is finished.

Data conversion and loading is not just a step in the source data-->ods, ODS-->DW, DW-->DM contains more important and complex ETL processes.

1. What is ODS.

The ODS (Operational Data Store) is an optional part of the Data Warehouse architecture,

ODS has some features of the data warehouse and some features of the OLTP system,

It is a "theme-oriented, integrated, current, or near-current, changing" data. ---from Baidu Encyclopedia

In fact, most of the time, ODS just acts as a temporary data storage, data buffering role. Generally speaking

After the data is loaded into the ODS by the source data, it is retained for a period of time, and when the subsequent data processing logic is problematic and needs to be recalculated,

Can be obtained directly from the ODS step, without having to extract the source data again, reducing the pressure on the source system.

In addition, ODS provides data directly to DM or front-end reports, such as some dimension tables or data that does not need to be computed and processed;

Also, ODS will do something else, such as storing some detail data for a rainy date, etc.

2. Data conversion (refresh):

Data conversion, more people call it data refresh, is to use an incremental or full amount of data in the ODS to refresh the table in the DW.

The tables in the DW are basically created in a pre-designed model, such as fact tables, dimension tables, summary tables, etc.

New data needs to be updated to these tables on a daily basis.

The procedures for updating these tables (programs) are developed at the very beginning, and each day only needs to pass some parameters, such as dates, to run the programs.

3. Data loading:

Personally, each insert data to a table, can be called data loading, as for Delete+insert, Truncate+insert,

or merge, which is determined by the business rules, which are embedded in the data extraction and transformation programs.

Four, ETL tools:

In traditional industry data warehousing projects, most of the existing ETL tools are used, such as Informatica, Datastage, Microsoft SSIS and so on.

These three kinds of tools I have used, the advantages are: graphical interface, development is simple, data flow clear; Disadvantages: limitations, inflexible,

It is difficult to deal with large data volume, and it is expensive to find errors.

Choosing an ETL tool takes into account the environment of the source system and the Data warehouse and, of course, the cost if both the source data system and the Data Warehouse use

ORACLE, then I think all ETL, can use the stored procedure to complete.

In the larger Internet companies, due to the large amount of data, special requirements, ETL tools for their own development,

Or do some two development on open source tools, and in actual work,

A stored procedure, a Shell/perl script, a Java program, and so on, can be used as ETL tools.

V. Meta-data in the ETL process:

Imagine that you as a newcomer take over someone else's work, no documentation, no program comments,

The tables and fields in the database do not have any comment, you will not be dozens.

Business system changes, delete a field, need Data Warehouse also make corresponding adjustment,

How do you know which programs are affected by changing this field?

。。。。

The field of the Source system table and its meaning, the IP of the source system database, the interface person, the field of the Data Warehouse table and its meaning,

The correspondence between the source table and the target table, a task corresponding to the source table and the target table, the dependency between the tasks,

Task execution and so on, and so on, if these meta-data can be strictly controlled, the above problem is certainly not a problem ...

Reproduced from: http://superlxw1234.iteye.com/blog/1666960

Want to say this article is dry goods, said is very real, is the technology is concentrated in the inside.

About the above in here to say their own experience

3. Extraction strategy: Small data table (such as 50w) as far as possible to use full-volume extraction, you can avoid data omission and other errors.

D) Incremental Hash comparison This strategy in the ETL tool kettle inside a similar strategy implementation, first from the source system to do a full amount to the target table, and then from the source system to take the whole amount of the primary key and the target table, if the target table is not the new, the target table has the source system is not that is deleted, The source system has a target table and changes that is the update.

d) Example: Kettle Introduction (vii) of the Kettle Increment Scheme (i) the total ratio to take the increment-according to the unique indicator

ORACLE, then I think all ETL can be done with stored procedures. About this sentence of the article, I think for the T, L process can be similar to say, but the e process is not, like from the various source system data to do incremental, batch submission to the ODS table, or with ETL tools such as kettle such a visual interface configuration is more convenient and good management.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More