Assembly Area
Preparing data, often also called data management, refers to acquiring data and translating it into information, and ultimately submitting that information to the front-end query interface. The background does not provide query services, the Data Warehouse methodology assumes that data access in the background is strictly forbidden, which is the sole purpose of the foreground. The backend part of the data warehouse is often referred to as the staging area (Stagingarea). Data aggregation mainly refers to writing to disk, ETL four main steps to have data aggregation. The following figure is a data Warehouse component architecture diagram
The meaning of the staging area
Is whether the data is stored in a physical staging area or directly in memory. This problem is one of the most fundamental choices in the ETL architecture. The efficiency of ETL processing developed depends heavily on the ability to balance physical I/O with memory processing. The ability to achieve an ideal balance between writing data to a staging table and keeping it in memory is a big challenge and must be considered in the optimization process. The final decision depends on the following two conflicting goals:
Get the data from the data source to the final destination as quickly as you could
Ability to recover without starting from scratch when an error occurs in the processing process
The strategy for data aggregation varies greatly depending on the environment and business requirements. If you plan to handle all ETL data processing in memory, do not forget that any kind of data warehouse, regardless of its architecture and operating environment, contains a certain form of staging area. The reason for aggregating data before loading into the Data warehouse is based on the following considerations:
can be restored
In most enterprise environments, when data is extracted from the source system, a series of important transformations are performed, assuming that for a given table, the amount of conversion is very large, then, according to our best practice, the data should be assembled immediately after it is extracted. These aggregation tables (in the database or file system) can be used as recovery points. Once the conversion process has an error, using these tables, the process does not need to access the source system again. Similarly, if the loading process fails, there is no need to re-convert. If the aggregation data is purely for the purpose of recovery, then the data should be stored in a sequential file in the file system, not a database. Data aggregation for recovery purposes is particularly important for extracting data from business systems, as data in business systems is overwritten and modified.
Backup
In general, the huge amount of data makes it impractical to make reliable data warehouse backups at the database level. As long as the load file has been saved, compressed, and archived, we can avoid the disaster caused by database failure. If the aggregation table is stored in the file system, it can be compressed into very small files and stored on the network. When data needs to be reloaded into the Data Warehouse, only the loaded files need to be decompressed and reloaded.
Audits
Many times, the data lineage between the source system and the target system is lost in the ETL code, and when the ETL process is audited, the existence of the data aggregation area makes it possible to directly compare the different stages in the ETL process. Because at this time the auditor (or programmer) can simply compare the original input file and output file to check the logic transformation rules. Aggregating data is especially useful when the source system overwrites historical data. When an event occurs, you may have questions about the integrity of the data in a few days or even weeks in the Data warehouse, and checking the extracted data from the staging area of the corresponding time period will help you restore confidence in the data accuracy of the Data warehouse.
The ultimate goal: to get the data from the data source to the final target as quickly as possible, and to recover without starting from scratch when the process is wrong.
Design the Staging area
The staging area stores data in its own way for the final Data Warehouse display area. Sometimes, the staging area data is saved to support functions that require historical data to complete, while other times the staging area data is deleted after each processing process is completed. The staging area used to maintain historical information is often referred to as the persistent staging area (persistent staging areas). The data in the staging area is deleted after each loading process. Most data staging areas use mixed mode, which uses both temporary and persistent aggregation tables.
When designing a staging area, you can estimate the following estimates: Table name, update policy (truncate/insert/update), load frequency, ETL job (operations or programs that operate the staging area table or file). When multiple jobs operate on a single table, all jobs are listed in this field of the estimate table, number of initial rows, average president, growth, number of rows per month, initial size, etc.
This is an aggregation table that is assumed to be created in the DBMS system, in fact, the staging area usually uses the DBMS and the file system text file. In most cases, you need to use flat files outside the DBMS to assemble the data for higher-speed sequential processing performance. In this case, you need to use the File System scale estimate table. The usual way is to place the file in the development zone and then record the space occupied as an estimate of the official space allocation.
the data structure in ETL system
Flat File
Flat files: If you do not use a dedicated ETL tool, but rather do all the ETL tasks in the database using SQL, then you need to create DBMS tables to store all the aggregated data. If you are using ETL tools, you can use simple files in the file system to store aggregated data. When aggregation data is stored in a file system as rows and columns as a database table, we call it flat files or sequential files, and ETL tools or scripting languages can manipulate ASCII flat files as easily as database tables, and in some cases faster processing.
The DBMS requires a high price to maintain the source data to process the data, which is not really important in some simple data staging areas, and it is much faster to operate outside the DBMS based on experience such as sorting, merging, deleting, replacing, and other data migration functions. There are a number of specialized program tools for working with text files.
If it is a script language to manipulate the flat file, you are obliged to provide a metadata tracking table for the conversion.
The disadvantage of flat files: Flat files cannot be indexed for quick lookups. Data truncation or insert operation, the relational table is faster than the flat file, the ETL system in the Aggregation area table read data, if you need to filter or need to connect between the different tables, the use of database storage is a better choice
The ETL process uses flat files more realistically when the basic requirements are the following:
1. The aggregation of source data for the purpose of protection and recovery. If the data is extracted, subsequent ETL errors, you must be able to re-execute the ETL process without the need to revisit the source system, to ensure that recovery without frequent access to the source system is the best way to save the extracted data inside the flat file
2, data sorting. Sorting in the file system is more efficient than order by.
3, filter. grep is more efficient than the where statement after the index is established
4, replacement/partial replacement of text strings, operating system level than the database more efficient
5, aggregation. In the ETL data stream before the data is loaded into the database, or if the aggregated data is only available through database filtering, it is done in the database. If aggregations are completed outside the database, specialized sorting packages are often required. If it is done in the database, it will use the sorting function of the database, although it will occasionally export a lot of data, using the sort package
6, the source data reference. In a normalized transaction processing system, there is usually a unique reference table to support the other tables. For example, a common state table can support order status, shipping status, and payment status. In addition to querying the same reference table repeatedly in the source data, it is more
One way to be efficient is to extract the reference table to the data staging area at once. The ETL tool will find the referenced data from the staging area. Most ETL tools can load reference tables into memory and make them valid throughout the ETL process. Accessing the reference table from memory is extremely fast. In addition, the use of reference tables in the staging area allows many table joins in the source system to be omitted, which makes querying the source system simpler and more efficient.
XML Data set
XML datasets are often not used in ETL systems to permanently store staging area data, and they are more suitable as standard formats for input and output of ETL systems.
XML and HTML differences: XML uses the form of plain text documents to store data and metadata, but does not contain formatting information. The HTML contains data and formatted information, but does not contain metadata.
Without taking into account the complexity of the structure of the XML-tier system, XML is the most efficient middle tier that currently transfers data in incompatible systems, because XML provides enough information to generate a full CREATE TABLE statement for a relational database and provide data in the correct column type. XML defines the common language of data sharing, which is his advantage. The disadvantage of XML is the redundancy of the XML document structure for the transmission of large data volume. Using XML requires both parties to identify possible tags by exchanging a specific type definition document (DTD). The DTD establishes the basis of metadata understanding for exchanging XML between two parties.
Relational Tables
The staging area data can be stored in a relational DBMS. In particular, using a database table is most appropriate when you are not using specialized ETL tools. Using a database to store a staging area table has the following advantages:
The
Visual metadata. The main flaw in using flat files is the lack of intuitive metadata. Using relational tables to store data, DBMS automatically maintains technical metadata, and business metadata can be easily attached to database tables. The column name, data type and length, and cardinality information can be inherited from the database system. The business descriptions of tables and columns are usually added as elements to the DBMS data directory.
relationship capability. Mandatory data integrity or referential integrity between entities can be easily maintained in a relational database environment. If you receive data from a non-relational system, it is necessary to assemble the data in a canonical model before it is converted to a dimension model.
Open repositories. Once the data enters the DBMS, the data can easily be accessed by any SQL-enabled tool (if appropriate permissions are given). Ensuring access to data is critical when conducting quality assurance testing and audits.
DBA support. In many enterprise environments, the DBA team is only responsible for the data in the DBMS. Data outside the database, such as file systems, is typically not maintained by the DBA. If the staging area data is not in the DBMS, the ETL team must undertake tasks such as space allocation, backup and recovery, archiving, and security.
SQL interface. Many times we need to write SQL to manipulate the data and get the data in the correct format. We all know that SQL is a standard language, easy to write and powerful. In the IT field, I'm afraid SQL is the most widely mastered programming language. Most database systems provide powerful SQL functions that help users save time in manual programming. In addition to enforcing referential integrity checks, the ability to use your own SQL is the main reason for storing staging area data in a database environment. the
standalone DBMS worksheet
Transaction database is designed for data entry, the dimension model is designed for data output, and the staging area design contains both. The reason for using stand-alone tables is that simple is the best. This is called a stand-alone table because these tables have no dependencies on other tables in the database. In a transactional environment, these tables are called lone tables because they have no relation to other tables in the model. Because the stand-alone aggregation table does not have any correlation, a stand-alone table is the primary candidate for establishing storage outside of a relational database.
Most of the time, the purpose of creating a staging table is to store data so that data can be manipulated again using SQL or scripting languages. In many cases, especially small data warehouse projects, stand-alone tables can meet all of the data staging area requirements. Because stand-alone tables do not require normalization, they cannot be considered as dump files. Dump files are usually created arbitrarily, regardless of disk space or query efficiency. Each field of a stand-alone file or stand-alone table must have a theme and a logical definition. The extra columns are ignored in the design of any stand-alone table.
For database tables, it is necessary to establish and implement a reasonable index plan on all stand-alone tables. Because the process of accessing a stand-alone table is only most of the time, the ETL process, there is no need to establish a bitmap index used in the display area, the bitmap index is used for the final
User tools and ad hoc queries. In an ETL system, a B-Tree
Index is used more in single-column or composite columns.
three-paradigm entity/relationship Model
In fact, we rarely build a third-paradigm model of data aggregation areas. In many cases, the data elements in a layer are from several different data sources, have different granularity, and include many external data from non-relational data sources. In this case, it is common to eliminate data redundancy and enforce integrity before loading into the dimension data model. The ideal way to do this is to focus on the independent "problem dimension table" and check it thoroughly to ensure that the original dirty data has been properly cleaned. Remember that the primary result of normalization is to enforce a well-defined many-to-one relationship, unless you need to manually check the graphical description of the model, otherwise the annotation of the entity diagram is neither forced nor interpreted. When modeling should be specific considerations, only when necessary to the entity normalization.
in general, do not assume that the data staging area must be normalized. The design of ETL processing should be based on two basic objectives: fast and recoverable. If the aggregation made during the ETL process has neither physical data operations nor speed or support recovery, then it should be removed.
non-relational data sources
An important reason for building a staging area environment is to integrate non-relational data. The integration of non-relational data usually requires some integrity checks, the assurance of data integrity is not without cost, usually requires the actual storage in the data staging area, and the ETL process needs to be customized to ensure that the business rules are correct, and these business rules in the source system of the relational database is automatically maintained. This means that there is a link between the tables, usually a parent-child relationship. The orphan record appears as a sign that referential integrity has been compromised. For example, if there is a status column in an order table, the specific value cannot be entered if it does not exist in a separate state reference table. In this scenario, the state table is the parent of the corresponding column in the Order table. The parent in the state table cannot be deleted casually, unless all the children have been deleted, or the child becomes an orphan record. The orphan record refers to any child records that do not have a parent. Similarly, if a foreign key is referenced to an order in the state table, then the primary key of any order cannot be deleted. The orphan record appears as a sign that referential integrity has been compromised.
A
Non-relational data source does not guarantee referential integrity, and a non-relational system is essentially a collection of unrelated tables. In many legacy transaction systems, parent-child relationships can only be guaranteed by a predecessor application. Unfortunately, after years of operation, any data integrity in the database system cannot be guaranteed due to unavoidable scripting operations or data manipulation outside of the application. You can be sure that data quality problems are more or less in non-relational data sources.
Unlike transactional systems, the best practice for designing data staging areas is to perform integrity checks in an ETL process rather than in a database. The difference is that the transaction system expects the data input to be correct, and an artificial input error will cause an error to be thrown, prompting for the correct value to be re-entered. In contrast, the ETL process must know how to handle data exceptions automatically. The ETL process cannot simply reject all data integrity errors, because no one will be able to re-enter the correct data in time. We need to define different business rules for different data quality error scenarios and implement them in the ETL process. When the wrong data through the ETL process, sometimes you want to convert directly, sometimes do not modify the direct loading, sometimes need to append the relevant code or add some description of the impact and environment, and then load, or if the data is not acceptable, then completely reject the data and write it to reject the file for further investigation. Remember not to over-use the Reject file. Reject files are used for dump data that we need to follow up on for specialized processing. When a record enters a reject file, synchronization between the data warehouse and the production system is destroyed unless it is processed before the next major load step begins to run.
Basic Database referential integrity does not fully handle each of these scenarios. ETL processes often require manual coding logic to ensure successful integration of non-relational data sources. the
Proxy key Mapping table
Proxy key Mapping represents the mapping between the natural key to the primary Data Warehouse proxy key for each source system, and the mapping table is a very effective way to maintain the Data Warehouse proxy key. These tables are compact and designed for high-speed processing. The mapping table contains only the most recent surrogate key values to access and their corresponding natural key values in the source system. Because the same dimension can have different sources, a separate column is created in the mapping table for each source's natural key. The
Mapping table can have the same high efficiency whether it is stored in a database or in a file system. If you use a database, you can use the database Sequence number generator to create surrogate keys, and if the indexes are used properly, the lookup of key values is very efficient.
because the key mapping table does not analyze the value, it cannot be built in the presentation layer of the Data Warehouse, nor can it be exposed to the end user.
Impact Analysis the role of the
Impact analysis is to examine the metadata associated with the object (which is referred to as a table or column) and to determine how the object's changes affect its content and structure. Changes to the data staging area object can disrupt the loading process of the Data warehouse. Allowing arbitrary modification of data staging area objects can be detrimental to the success of the project. Once a table has been created in the data staging area, impact analysis must be done before any changes are made.
metadata Capture
The data modeling tool used when designing the staging area database can visualize the presentation metadata, and the data modeling tool stores the available metadata in his writing library. In addition, the ETL tool generates metadata about the process and can display all of the conversion processes in it. The metadata types derived from the staging area include:
Data lineage: The most interesting metadata in all data warehouse meta-databases may be a data lineage, or a logical data map, that illustrates how the data elements are transformed from the original data source to the final data Warehouse target.
Business definition: All tables created in the data staging area are derived from the business definition. Business definitions can be obtained from a variety of locations, including data modeling tools, ETL tools, the database itself, or spreadsheets and Word documents. In any case, you need to use a business definition on the Data Warehouse presentation layer to maintain its consistency.
Technical definition: Especially for data staging areas, technical definitions are more prevalent than business definitions. Remember, if there is no documentation, it means that the technical definition does not exist. If the technical definition of a table in the data staging area does not have a detailed document, the table will likely be rebuilt over and over again, resulting in a large amount of data duplication in the data staging area, resulting in data explosion. A technical definition should describe all the physical properties of a data element, including its structure, format, and location. Documentation of technical metadata for all tables in the staging area minimizes uncertainty and provides a
high reusability.
Procedure metadata: the statistics of the load process for a data staging area table must be recorded along with the statistics of the Data Warehouse table load. Although information about the data staging area loading process does not need to be presented to the end user, the ETL team needs to know how many records are loaded in each table, and the statistical results of the
for each process success or failure. Information about the frequency of data refresh is useful for both ETL administrators and end users.