Optimization/operation
A Describe the basic types of tables in the Data warehouse and how they should be loaded in order to ensure referential integrity.
The basic types of tables in the Data warehouse are the dimension table, the fact table, the Child dimension table, the bridge table, and so on. The Child dimension table, the snowflake model, is handled by the bracket dimension technology, and the bridge table is used to handle the multivalued dimension or hierarchy.
There are interdependent relationships between the types of tables that need to be loaded in the Data warehouse, so the load needs to be loaded in a certain order. Here are some basic principles for loading:
After the child dimension table loads successfully, the dimension table is loaded.
After the dimension table has been successfully loaded, the bridge table is loaded.
The Child dimension table, dimension table, and bridge table are loaded successfully before the fact table is loaded.
This load sequence can be determined by the relationship of the primary foreign key.
Two The characteristics of four levels of ETL technical support work are briefly described.
After the Data Warehouse is on-line, the ETL group needs to provide technical support for the normal operation of the ETL work. Typically, this technical support work is divided into four levels.
1. The first level of technical support is typically a phone support person, which is a Technical support services window (Help Desk) type. If there is an error in data migration or if the user discovers a problem with the data, the problem is reflected by the phone to the first level of technical support. The first level of support staff through the ETL project team to provide solutions to some of the problems found to solve the problem, to prevent escalation.
2. The second level of technical support is typically system administrators and DBAs. If the first level does not solve the problem, the problem is reflected in the second level. Second-tier people are usually technically strong, and the hardware infrastructure and software architecture issues can be addressed.
3. The third level of technical support is usually the ETL project leader. If the second level does not solve the problem, the problem is reflected to the third level. The ETL project owner should have enough knowledge to solve the most problems in the production environment. The ETL project owner can communicate some issues with the developer or external product provider when necessary in order to find a way to solve the problem.
4. Technical support at level fourth is usually the actual developer of the ETL. If the third level does not solve the problem, the problem is reflected in level four. Real-world developers of ETL can trace and analyze code to find solutions to problems. If the problem arises in the application of the product supplier, the supplier will also be required to provide technical support.
In smaller data warehouse environments, and typically, the third and fourth levels can be merged together. The requirements for the second level will be higher after the merger. It is not recommended to find ETL developers every time a problem occurs. The first level of support staff should not only provide phone support services, before the problem is reflected to the next level, to do their best to solve the problem.
Three If the ETL process runs slowly, it takes a few steps to find the bottleneck of the ETL system.
ETL system encountered performance problems, running very slowly is a more common thing, at this time to do is to gradually find the system bottleneck where.
The first thing to determine is the CPU, memory, I/O and network bottlenecks, or by the ETL process generated bottlenecks.
If the environment does not have a bottleneck, then you need to analyze the ETL code. At this point, we can use the exclusion method, we need to isolate different operations and test them separately. If the use of pure hand-coded ETL processing, isolation of different operations to be troublesome, then need to be based on the actual coding of the situation to deal with. If the use of ETL tools, the current ETL tools should have isolated different processing functions, isolation is relatively easy.
The analysis is best to start with the extraction operation, then analyze all kinds of calculation, look-up table, aggregation, filter and so on, and finally analyze the loading operation.
In actual processing, you can follow the seven steps below to find bottlenecks.
1. Isolate and execute an extract query statement.
Isolate the extraction section first, remove the conversion and delivery, and extract the data directly into the file. If this step is inefficient, the problem of extracting SQL is basically determined. From experience, an unadjusted SQL is one of the most common causes of poor ETL efficiency. If this step is not a problem into step two.
2. Remove the filter condition.
This is for full extraction, and then in the ETL processing to filter the way of processing. Filtering in the ETL process can sometimes create bottlenecks. You can remove the filter first, and if this is the reason, consider data filtering at the time of extraction.
3. Troubleshoot the lookup table.
Reference data is typically loaded into memory during ETL processing to make code and name lookup substitutions, also known as lookup tables. Sometimes finding a table with too much data can create bottlenecks. You can isolate the lookup table individually to determine if there is a problem here. Note To minimize the amount of data in the lookup table, a natural key is usually a surrogate key, which can reduce unnecessary data I/O.
4. Analyze sorting and aggregation operations.
Sorting and aggregating operations are very resource-handling operations. Isolate this section to determine if they are causing performance problems. If this is the case, consider whether sorting and aggregating processing can be moved out of the database and ETL tools and moved to the operating system for processing.
5. Isolate and analyze each calculation and transformation processing.
Sometimes the processing operations in the conversion process can also cause the performance of the ETL work. Gradually isolate them to determine where the problem is. Watch for actions such as default values, data type conversions, and so on.
6. Quarantine update policy.
Update operations are very poor performance when the volume of data is very large. Isolate this section and see if there is a problem. If it is determined that a performance issue has been updated because of a large quantity. You should consider separating the insert, update, and delete processes.
7. Detects database I/O to load data.
If there are no problems with the previous sections, the final need to detect is the performance issue of the target database. You can find a file to replace the database, if the performance increased a lot, you need to carefully detect the target database in the load process operations. For example, whether all the constraints are turned off, all indexes are turned off, and whether the bulk load tool is used. If performance has not improved, you can consider using a parallel load policy.
Four Describes how to evaluate large ETL data load times.
Evaluating the data loading time of a large ETL is a complicated matter. Data loading is divided into two categories, one for initial loading and the other for incremental loading.
When the Data warehouse is officially put into use, a first load is required, and the time required for this initial load is generally more difficult to predict. In the daily use and maintenance of the Data Warehouse, the data warehouse needs to be loaded incrementally every day. The amount of data that is loaded incrementally is much smaller than the initial load.
Here is an example of how to evaluate the data loading time of a large ETL with the first load.
To estimate the loading time of the initial loading, the whole ETL process should be divided into three parts of extraction, transformation and loading, and the three parts are evaluated separately.
1. The evaluation of the time of extraction.
Extract most of the time that is typically used for ETL, and it is also very difficult to assess this part of the time required. To evaluate this part of the time, we can divide the query time into two parts, one for query response time and the other for data return time. Query response time refers to the time from the beginning of the query execution to the beginning of the return of the result. Data return time refers to the time returned by the first record to the last record.
In addition, the initial load of the amount of data is too large, we can consider selecting a part of the overall time to evaluate, in the actual processing, you can select a partition of the fact table. In general, the amount of data in each partition is almost the same, the time of a partition is evaluated, and the number of partitions can be used as the overall time of assessment.
2. Assessment of data Conversion time
Data conversion is usually done in memory, and generally has a very fast speed, the proportion of total time is small. If you want to evaluate this part of the time, the simplest method is to evaluate the extraction time and load time, and then run the entire process, with the overall time minus the extraction time and load time.
3. Evaluation of loading time
There are many reasons that can affect load time, the most important of which are indexes and logs, respectively. Two.
The evaluation of load time, as well as the evaluation of the extraction time, the same as the selection of loading data, such as 1/200 to load, calculated time after times 200来 as the overall load time.
In a word, it is difficult to evaluate the loading time of the large ETL data, and the method we use is mainly the analogy evaluation, which is to select some data to reduce the overall time to evaluate. When conducting an assessment, it is noted that differences in the test environment and the configuration of the production environment can lead to deviations in the evaluation results. Although this evaluation of time will have errors, but can be as a reference to the overall load time.