Reprinted ETL architect interview questions
1. What is a logical data mapping and what does it mean to the ETL team?
What is Logical Data ing? What role does it play on the ETL project team?
A:
Logical Data Map) describes the data definition of the source system, the model of the target data warehouse, and instructions on operations and processing methods to convert the data of the source system to the data warehouse, the following information is usually saved in a table or Excel format:
Target Table Name:
Target column Name:
Target table Type: Indicates fact table, dimension table, or bracket dimension table.
SCD type: For dimension tables.
Source database name: Instance name or connection string of the source database.
Source Table Name:
Source column Name:
Conversion Method: operations that need to be performed on the source data, such as sum (amount.
Logical Data ing should run throughout the data migration project and describe the ETL policy in data migration. Logical Data ing before physical data ing is important to the ETL project team and plays a role in metadata. It is best to select a data migration tool that can generate Logical Data ing in the project.
2. What are the primary goals of the data discovery phase of the Data Warehouse project?
What is the main purpose of the data exploration phase in the data warehouse project?
A:
Before Logical Data ing, you must first analyze all source systems. The analysis of the source system usually involves two phases: data discovery phase and exception data detection.
The data exploration stage includes the following:
1. Collect documents and data dictionaries of all source systems.
2. Collect the usage of the source system, such as who is in use, how many people are using each day, and how much storage space is occupied.
3. Determine the starting source (system-of-record) of the data ).
4. analyze the data relationship of the source system through data profiling.
The main purpose of the data exploration stage is to understand the situation of the source system and lay a solid foundation for subsequent data modeling and Logical Data ing.
3. How is the system-of-record determined?
How to determine the start source data?
A:
The key to this problem is to understand what is system-of-record. System-of-record is the same as many other concepts in the data warehouse field. Different people have different definitions for it. In Kimball's system, system-of-record refers to the place where data is initially generated, that is, the initial source of data. In large enterprises, data is stored in different places in redundancy. During data migration, operations such as modification and cleaning may occur, as a result, it is different from the initial data source.
The initial source data plays an important role in the establishment of a data warehouse, especially for generating consistent dimensions. As we establish a data warehouse from the downstream of the source data, the risk of junk data increases.
Architecture
4. What are the four basic data flow steps of an ETL process?
What are the four basic ETL processes?
A:
In the Kimball data warehouse construction method, the ETL process is different from the traditional implementation method, which is divided into four stages: extract and clean), consistency processing (comform) and delivery (delivery), referred to as ECCD.
1. The main tasks in the extraction phase are:
Read the data model of the source system.
Connect to and access the data of the source system.
Change Data Capture.
Extract data to the data preparation area.
2. The main tasks in the cleaning phase are:
Clean and add column attributes.
Clean and add data structures.
Clean and add data rules.
Add complex business rules.
Create a metadatabase to describe the data quality.
Save the cleaned data to the data preparation area.
3. The main tasks in the consistency processing phase are:
The service tag of consistent processing, that is, the description attribute in the dimension table.
Consistent processing of business metrics and performance indicators is usually a fact in fact tables.
Remove duplicate data.
International processing.
Save the data after consistency processing to the data preparation area.
4. The main tasks in the delivery phase are:
Load the dimension table data of the star and processed by Snowflake.
Generate date dimension.
Load degradation dimensions.
Load subdimensions.
Load 1, 2, and 3 slowly changing dimensions.
Handle latencies and latencies.
Load multi-value dimensions.
Load dimensions with complex hierarchies.
Load text facts to dimension tables.
The proxy key used to process the fact table.
Load three basic types of fact table data.
Load and update aggregation.
Load the processed data to the data warehouse.
From the task list, we can see that the ETL process is closely integrated with the data warehouse modeling process. In other words, the ETL system should be designed at the same time as the target table. Generally, the Data Warehouse architect and the ETL system designer are the same person.
5. What are the permissible data structures for the data staging area? Briefly describe the pros and cons of each.
What data structures are allowed in the data preparation area? What are their advantages and disadvantages?
A:
1. Fixed text files. (Flat file)
Flat file is a text file format stored on the system. It stores data in rows and columns in a way similar to a database table. This file format is often used for data exchange. It is inappropriate to store data.
2. xml dataset.
It is mostly used for data exchange, and it is not suitable for users to save data.
3. relational database tables.
Ideal for saving data.
4. Independent database tables.
An independent database table generally refers to a table that has no foreign key constraints with other tables. Such tables are mostly used for data processing.
5. Three paradigm or relational model.
6. Non-relational data sources.
Non-relational data sources generally include COBOL copy books, VSAM files, flat files, and spreadsheets.
7. dimension model.
8. atomic fact tables and clustered fact tables.
9. The proxy key is used to search for a table.
6. When shocould data be set to disk for safekeeping during the ETL?
Briefly describe which step in the ETL process should data be written to the disk for security reasons?
A:
Staging means to write data to the disk. For the sake of security and ETL, data in each step of the data preparation area (staging area) should be written to the disk, that is, to generate a text file or store data in a relational table, instead of directly performing ETL without data being implemented.
For example, in the data extraction stage, we need to connect to the source system. To minimize the impact on the source system, we need to save the extracted data as a text file or put it into a table in the data preparation area, in this way, when the ETL process fails due to an error, we can start ETL from these text files without affecting the source system again.
Extract
7. Describe techniques for extracting from heterogeneous data sources.
Briefly describe the data extraction technology in heterogeneous data sources.
A: In a data warehouse project, the data to be extracted often comes from different data sources. Their logical and physical structures may be different, that is, they are called heterogeneous data sources.
When integrating and Extracting Heterogeneous Data sources, we need to identify all the source systems in sequence, analyze the source systems, define data matching logic, and establish filtering rules, generate consistent dimensions.
The operating system platform and Data Platform of the source data are different. We need to determine how to extract data according to the actual situation, common methods include establishing ODBC connections, defining interface files, and creating dblink.
8. What is the best approach for handling ERP source data?
What is the best way to extract data from the ERP source system?
A: The ERP system is generated to solve the integration of heterogeneous data within the enterprise. This problem is also the main problem facing the data warehouse system. The ERP solution is to establish various applications (including sales, accounting, human resources, inventory and products) within the enterprise under the same platform and application framework, that is, the data in the enterprise is processed in a consistent manner at the application operation layer. The data warehouse establishes consistency rules on the application operation layer for consistency processing. Currently, SAP, PeopleSoft, Oracle, Baan, and J. D. Edwards are popular ERP systems ).
If there is only one ERP system in the enterprise, the data will be consistent, providing convenience for data extraction. If there are other systems in addition to ERP in the enterprise, data extraction will become complicated. Because the data models of the current ERP system are very complex, there may be hundreds of thousands of tables, which is hard to understand. It is very complicated to establish data capture and extraction directly on the ERP system. The best solution is to purchase an ETL tool that provides functions for ERP system data extraction and leave the complexity of ERP to the ETL manufacturer for processing.
9. Explain the pros and cons of communicating with databases natively versus ODBC.
Briefly describe the advantages and disadvantages of directly connecting to the database and connecting to the database using ODBC for communication.
A: There are two methods to connect to a database: direct connection and ODBC connection.
The direct connection method is mainly to connect to the database through COBOL, PL/SQL, and transact-SQL. The advantage of this method is the high running performance. You can use some special functions provided by DBMS. The disadvantage is poor versatility.
ODBC is a set of interfaces for Windows applications to access databases. The advantage of ODBC is its flexibility. You can use different databases by changing the driver and connection methods. The disadvantage of ODBC is poor performance. If you use ODBC connection to implement ETL, there must be at least two layers of ETL program, namely the ODBC manager layer and ODBC driver layer. In addition, you cannot use some special functions provided by DBMS in ODBC mode.
10. Describe three Change Data Capture (CDC) practices and the pros and cons of each.
This article briefly describes three techniques for changing data capture and their advantages and disadvantages.
A:
The Change Data Capture (CDC) technology is the focus and difficulty in ETL work. It usually needs to be completed during incremental extraction. When capturing changed data, it is best to find the DBA of the source system. If it cannot be found, the ETL project team needs to detect data changes by themselves. Below are some common technologies.
1. Use Audit Columns
An audit column is a field in the table that contains information such as "add date", "Modify date", and "modifier. When the application is operating on the data in the table, it also updates these fields, or creates a trigger to update these fields. The advantage of using this method to capture changed data is that it is convenient and easy to implement. The disadvantage is that if the operating system does not have an audit field, you need to change the data structure of the existing operating system to ensure that each table involved in the acquisition process has an audit field.
2. database logs
DBMS log acquisition is a way to obtain changed data through the log system provided by DBMS. It has the advantage that it has the least impact on the database or the operating system that accesses the database. The disadvantage is that DBMS support is required and the log record format is well understood.
3. Full table Scan
Full table scan or full table export file scan and comparison can also capture changed data, especially when capturing deleted data. The advantage of this method is that it has a clear idea and wide adaptability. The disadvantage is that it is less efficient.
Data quality
11. What are the four broad categories of data quality checks? Provide an implementation
Technique for each.
What are the four categories of data quality check? Provides an implementation technology for each type.
A: Data Quality Check is an important step in ETL work. It focuses on four aspects.
1. Check the correctness (corret)
Check whether the data value and description reflect objective transactions. For example, whether the address description is complete.
2. Clear check (unambiguous)
Check whether the data value and its description have only one meaning or only one explanation. For example, two counties with the same place names need to be differentiated.
3. consistent)
Check whether the data value and its description are uniformly expressed using fixed agreed symbols. For example, the RMB in the currency is 'cny '.
4. Complete check (complete)
There are two items to check for completeness. One is to check whether the data value and description of the field are complete. For example, check whether there are null values. The other is to check whether the total value of the record is complete and whether certain conditions are forgotten.
12. at which stage of the ETL shocould data be profiled?
Briefly describe which step of ETL should be used for profile analysis?
A: Data overview analysis analyzes the content of source data. It should be completed as early as possible after the project starts, which will have a great impact on the design and implementation. After requirement collection, you should start data overview analysis immediately.
Data overview analysis is not only a quantitative description of the Data Overview of the source system, but also an error event table and an audit dimension table that must be created in the ETL system) lay the foundation for providing data.
13. What are the essential deliverables of the data quality portion of ETL?
What are the core deliverables of data quality in ETL projects?
A: The core deliverables of the data quality section in the ETL project are as follows:
1. Data overview analysis results
The data overview analysis result is an analysis product of the Data status of the source system, including the number of tables in the source system and the number of fields in each table, whether the foreign key relationship between tables exists or not reflects the data quality of the source system. These contents are used to determine the design and implementation of data migration, and provide relevant data required for the error event fact table and audit dimension table.
2. Error Event fact table
The error event fact table and a series of related dimension tables are a major deliverables of the data quality check. Granularity is the error message in each data quality check. Related dimensions include the date dimension table, migration information dimension table, and error event information dimension table, in the error event information dimension, the type, source system information, involved table information, and the SQL statement used for the check are displayed. The error event fact table is not provided to foreground users.
3. Audit dimension table
An audit dimension table is a dimension table that provides end users with data quality instructions. It describes the data source and quality of the fact table used by the user.
14. How can data quality be quantified in the data warehouse?
How can we quantify the data quality in a data warehouse?
A: In data warehouse projects, we usually use anomaly detection to quantify the data quality of the source system. Unless a special data quality survey project team is set up, this work should be completed by the ETL project team. Generally, grouping SQL can be used to check whether the data meets the defined rules of the domain.
For tables with small data volumes, you can directly use SQL statements similar to the following.
Select state, count (*) from order_detail group by state
For tables with a large amount of data, the sampling technology is generally used to reduce the data volume and then perform irregular data detection. Similar to SQL.
Select. * From Employee A, (select rownum counter,. * From Employee A) B where. emp_id = B. emp_id and Mod (B. counter, trunc (select count (*) from employee)/, 0) = 0
If you can use a dedicated data profile analysis tool, you can reduce the workload.
Building Mappings
15. What are surrogate keys? Explain how the surrogate key pipeline works.
What is a proxy key? This section describes how the proxy key replacement pipeline works.
A: During the migration of a dimension table, a processing method is to assign a meaningless integer value to the dimension record and use it as the primary key of the dimension record, these integer values as the primary key are called the surrogate key ). There are many benefits to using the proxy key, such as isolating the data warehouse and operating environment, saving historical records, and fast query speed.
In addition, in the migration process of fact tables, proxy keys are also required to ensure the integrity of the reference. To Make proxy key replacement more efficient, we usually create a proxy key lookup table (surrogate mapping table or lookup table) in the data preparation area ). The proxy key searches for the correspondence between the latest proxy key and the natural key in the table. When you replace a fact table with a proxy key, you need to load the data in the proxy key search table to the memory to ensure high efficiency, in addition, you can enable multiple threads to replace different proxy keys in the same record in sequence, so that a fact record can be written to a disk after all the proxy keys are replaced, this replacement process is called the proxy key replacement pipeline (surrogate key pipeline ).
16. Why do dates require special treatment during the ETL process?
Why does the date need to be specially processed in the ETL process?
A: In data warehouse projects, analysis is the dominant requirement, while date-and time-based analysis accounts for a large proportion. In an operating source system, dates are generally SQL datetime type. If you use SQL to temporarily process fields of this type during analysis, some problems may occur, such as poor efficiency. Different users may adopt different formatting methods, resulting in different reports. Therefore, a date dimension table and a time dimension table are created during data warehouse modeling, and the date-related descriptions used are redundant to the table.
However, not all dates are converted into foreign keys of the date dimension table. The records in the date dimension table are limited. Some dates, such as birthdays, may be earlier than the minimum dates recorded in the date dimension table, this type of field can be stored in the data warehouse directly in the datetime type of SQL. The business closely related to the analysis, such as the purchase date, usually needs to be converted into a foreign key of the date dimension table. You can use the unified description information of the date dimension table for analysis.
17. Explain the three basic delivery steps for conformed dimensions.
Briefly describe the three basic delivery steps for the consistency dimension.
A: The Key to data integration is to generate a consistency dimension, and then combine the fact data from different data sources through the consistency dimension for analysis and use. Generally, there are three steps to generate a consistent dimension:
1. Standardization)
The purpose of standardization is to make the data encoding methods and formats of different data sources the same, laying the foundation for data matching in the next step.
2. Match (matching and deduplication)
There are two types of data matching work. One is to match different attributes of the same thing identified by different data sources, which makes data more complete; the other is to mark the same data of different data sources as duplicates, laying the foundation for the next filtering.
3. Screening ving)
The main purpose of data filtering is to select a consistent dimension as the master data, that is, the final delivery of consistent Dimension Data.
18. Name the three fundamental fact grains and describe an ETL approach for each.
Briefly describe three basic fact tables and describe how to process them during ETL.
A: fact tables can be divided into three types based on the role granularity: Transaction granularity fact table (transaction grain) and periodic snapshot granularity fact table (periodic snapshot) and the cumulative snapshot granularity fact table (accumulating snapshot ). When designing a fact table, you must note that a fact table can have only one granularity. You cannot create facts of different granularities in the same fact table.
The source of the fact table of transaction granularity is the data generated by the transaction event, such as the sales order. During the ETL process, data is migrated directly at an atomic granularity.
The periodic snapshot fact table is used to record the accumulated business data at regular intervals, such as the inventory day snapshot. During the ETL process, the accumulated data is generated at a fixed interval.
The cumulative snapshot fact table is used to record the information of the entire process of business processing with a time span. In the ETL process, the records in the table are gradually improved as the business process steps are taken.
19. How are bridge tables delivered to classify groups of Dimension Records associated to a singlefact?
How does a bridge table associate a dimension table with a fact table?
A: bridge table is a special type of table in dimensional modeling.
When modeling a data warehouse, you may encounter a dimension table with a hierarchical structure. For such a table, you can create a parent-child table, that is, each record contains a field pointing to its parent record. This parent-child table is especially useful when the hierarchy depth is variable and is a compact and effective modeling method. However, this modeling method also has a disadvantage, that is, it is difficult to operate recursive structures with standard SQL.
Unlike the parent-child table of this recursive structure, the bridge table uses different modeling methods to represent this hierarchical structure. A bridge table is a table with more redundant information between a dimension table and a fact table. Its records include the paths from nodes in the hierarchy to each of the nodes below it. The table structure is as follows:
Parent keyword
Subkeyword
Parent Layer
Layer name
Bottom mark
Top ID
In the bridge table, a node establishes an association record with any of the following nodes in the table, that is, the parent-child relationship is no longer limited to the adjacent layer. For example, the first layer has a parent-child relationship with the third layer, the parent layers can be used to distinguish between several layers. In this way, you can query the hierarchical structure through the parent-child relationship and parent-child relationship.
Of course, bridging tables are not a complete solution. It can only be easier to query in some cases.
20. How does late arriving data affect dimensions and facts? Share techniques for handling each.
What is the impact of late data on fact tables and dimension tables? How can we solve this problem?
A: There are two types of late data: fact table data for late arrival and dimension table data for late arrival.
For late fact records, We can insert them into the corresponding fact table. You also need to perform some processing when inserting data. First, for fact records with SCD type 2 dimensions, it is necessary to determine whether the date of occurrence of the fact record has changed before insertion. If there is a change, this fact record must correspond to the dimension record when a fact occurs. Second, after inserting a fact record, the clustered fact table and merged fact table related to the fact table must be processed accordingly.
For late Dimension Records, we need to make the processing more complicated. First, if the late dimension record enters the data warehouse for the first time, you must generate a dimension record in the dimension table and update the foreign key of the fact record corresponding to the dimension record. Secondly, if the late dimension record is a modification to the original dimension, we need to find the fact row between this change of dimension and the next change while generating a new record in the dimension table, update the dimension foreign key to the new dimension proxy keyword.
Metadata
21. Describe the different types of ETL metadata and provide examples of each.
Provides metadata for various ETL processes.
A: metadata is a very important topic for the ETL project team and is also an important part of the entire data warehouse project. There is no definite definition of the classification and use of metadata.
Generally, metadata can be divided into three types: business metadata, technical metadata, and process execution metadata ).
Business metadata is a description of data from the business perspective. It is usually used to analyze and use data for report tools and front-end users.
Technical metadata is a description of data from a technical perspective. It usually includes some attributes of data, such as the data type, length, or some results after data profile analysis.
Process processing metadata is the statistical data in the ETL process. It usually includes data such as how many records are loaded and how many records are rejected.
22. Share acceptable mechanic ISMs for capturing operational metadata.
Briefly describe how to obtain operational metadata.
A: Operational metadata records data migration during the ETL process, such as the last migration date and number of loaded records. This metadata is very important when ETL loading fails.
Generally, data loading using ETL tools, such as migration scheduling time, migration scheduling sequence, and failure handling, can be defined and generated by the migration tool. Data such as the last migration date can be created and saved in tables.
If you are writing an ETL program manually, the processing of the operation-type metadata will be troublesome and you need to obtain and store it yourself. Different programming methods vary.
23. Offer techniques for sharing business and technical metadata.
Optimization/operations
Briefly describe how to share service metadata and technical metadata.
A: In order to share various metadata, some metadata standards must be set up in the data warehouse construction process and comply with these standards in actual development. These standards include metadata naming rules, storage rules, and sharing rules. For more information about the metadata standard, see common warehouse metamodel (CWM.
At the most basic level, enterprises should set standards in the following three aspects.
1. Naming rules
Naming rules should be formulated before the ETL group starts encoding, including database objects such as tables, columns, constraints, indexes, and Other encoding rules. If an enterprise has its own naming rules, the ETL group should comply with the naming rules of the enterprise. When the enterprise naming rules cannot fully meet the requirements, the ETL group can formulate supplementary rules or new rules. Changes to enterprise naming rules must be documented in detail and submitted to relevant departments for review.
2. Architecture
Before the ETL group starts to work, the architecture should be designed first. For example, whether the ETL engine is placed on the same server as the data warehouse or a separate server is set up; whether the data preparation area is temporary or persistent; whether the Data Warehouse is based on dimensional modeling or 3nf modeling. And the content should have detailed document records.
3. Infrastructure
The basic structure of the system should be determined first. For example, the solution is based on Windows or UNIX. These enterprise infrastructure metadata should be prepared before the ETL group starts to work. These contents should also have detailed document records.
In ETL development, metadata standards are well developed and can be well observed, so that the metadata of a data warehouse can be well shared.
24. State the primary types of tables found in a data warehouse and the order which they must be loaded to enforce referential integrity.
Briefly describe the basic types of tables in the data warehouse, and in what order should we load them to ensure the integrity of the reference.
A: The basic types of tables in a data warehouse are dimension tables, fact tables, subdimension tables, and bridge tables. The subdimension table is the snowflake model processed by the bracket dimension technology. The bridge table is used to process multi-value dimensions or hierarchical structures.
The tables to be loaded in the data warehouse are dependent on each other. Therefore, the tables to be loaded must be loaded in a certain order. Below are some basic Loading Principles:
After the sub-dimension table is successfully loaded, the dimension table is loaded.
After the dimension table is successfully loaded, load the bridge table.
The fact table is loaded after the sub-dimension table, dimension table, and bridge table are loaded successfully.
The loading sequence can be determined by the relationship between the primary and Foreign keys.
(Note: This answer is the order of loading tables in the data warehouse of the bus architecture .)
25. What are the characteristics of the four levels of the ETL support model?
Briefly describe the features of ETL technical support at four levels.
A: After the data warehouse is launched, the ETL group must provide technical support to ensure the normal operation of ETL. Generally, this technical support is divided into four levels.
1. Level 1 technical support personnel are usually telephone support personnel and belong to the help desk type. If an error occurs during data migration or the user finds a problem with the data, the problem is reported to the first level of technical support by phone. Level 1 support personnel try their best to solve the problems found through solutions provided by the ETL project team and prevent the problem from being upgraded.
2. Level 2 technical support is usually the system administrator and DBA. If the problem cannot be solved at level 1, the problem is reported to level 2. Level 2 personnel are usually technically strong, and problems in both hardware infrastructure and software architecture can be solved.
3. Level 3 technical support is usually the head of the ETL project. If the problem cannot be solved at Level 2, the problem is reported to Level 3. The ETL project owner should have sufficient knowledge to solve most problems in the production environment. The ETL project owner can communicate with developers or external product providers when necessary to find a solution to the problem.
4. The fourth level of technical support is usually the actual ETL developer. If the problem cannot be solved at level 3, the problem is reported to level 4. ETL developers can track and analyze codes and find solutions to problems. If the problem occurs in the application of the product supplier, the supplier must provide technical support.
In a smaller data warehouse environment, Level 3 and Level 4 can be combined. After merging, the requirements for Level 2 will be higher. We do not recommend that you contact the ETL developer every time a problem occurs. Level 1 technical support personnel should not only provide telephone support services, but should do their best to solve the problem before reporting the problem to the next level.
26. What steps do you take to determine the bottleneck of a slow running ETL process?
If the ETL process runs slowly, which of the following steps is required to locate the ETL System Bottleneck.
A: It is common for the ETL system to run slowly when it encounters performance problems. What we need to do is to gradually find out the bottleneck of the system.
First, determine whether the bottleneck is caused by CPU, memory, I/O, and network, or by the ETL processing process.
If the environment has no bottlenecks, You need to analyze the ETL code. In this case, we can use the exclusion method. We need to isolate different operations and test them separately. For ETL processing using pure manual encoding, it is more difficult to isolate different operations. In this case, you need to process the data according to the actual encoding conditions. If ETL tools are used, the current ETL tools should have different processing functions, which is easier to isolate.
It is best to start with the extraction operation, and then analyze the processing operations of various computing, search tables, clustering, filtering, and other conversion links in sequence, and finally analyze the loading operation.
In actual processing, you can follow the seven steps below to find the bottleneck.
1. Isolate and execute the extraction query statement.
Isolate the extraction part, remove the conversion and delivery, and extract the data directly to the file. If the efficiency of this step is poor, it is determined that the problem of SQL extraction is a problem. From experience, SQL without optimization is the most common cause of poor ETL efficiency. If this step is correct, go to step 2.
2. Remove filtering conditions.
This is the processing method for full extraction and then filtering in ETL processing. Filtering in ETL processing sometimes results in bottlenecks. You can remove the filter first. If it is determined for this reason, you can consider filtering data during extraction.
3. Exclude the problem of table searching.
The reference data is usually loaded into the memory during ETL processing. The purpose is to find and replace the code and name, also known as the search table. Sometimes, when the data size of a table is too large, a bottleneck may occur. You can isolate the query tables one by one to check whether the problem exists. Note that you need to minimize the amount of data in the query table. Generally, you can use a natural key and a proxy key to reduce unnecessary data I/O.
4. Analyze sorting and clustering operations.
Sorting and clustering operations are very resource-intensive operations. This part is isolated to determine whether they cause performance problems. If this is the reason, consider whether sorting and clustering can be removed from the database and ETL tools and moved to the operating system for processing.
5. Isolate and analyze each computing and conversion process.
Sometimes the processing operations during the conversion process may also lead to ETL performance. Remove them gradually to determine what went wrong. Observe the operations such as default value and data type conversion.
6. Isolate update policies.
The performance of the update operation is very poor when the data volume is very large. Isolate this part to see if there is a problem. If it is determined that the performance problem is caused by mass updates. Separate insert, update, and delete operations.
7. Check the database I/O for data loading.
If no problem exists in the previous sections, check whether the performance of the target database is faulty. You can find a file to replace the database. If the performance improves a lot, you need to carefully check the operations in the loading process of the target database. For example, whether all constraints are disabled, all indexes are disabled, and whether a batch loading tool is used. If the performance has not improved, consider using parallel loading policies.
27. Describe how to estimate the load time of a large ETL job.
Real Time ETL
This section describes how to evaluate the loading time of large ETL data.
A: It is very complicated to evaluate the data loading time of a large ETL job. There are two types of data loading: first loading and incremental loading.
When a data warehouse is officially put into use, it needs to be loaded for the first time, and the time required for the first loading is generally difficult to predict. In the daily use and maintenance of a data warehouse, incremental loading of the Data Warehouse is required every day. The incremental loading data volume is much smaller than the initial loading.
Next we will take the initial loading as an example to talk about how to evaluate the data loading time of large-scale ETL.
To estimate the loading time of the first load, You need to divide the entire ETL process into three parts: extraction, conversion, and loading. The three parts are evaluated respectively.
1. Evaluate the extraction time.
It usually takes most of the ETL time to extract, and it is very difficult to evaluate the time required. To evaluate the time, we can divide the query time into two parts: query response time and data return time. The query response time refers to the time from the execution of the query to the return of the result. The data return time refers to the time when the first record returns to the last record.
In addition, the amount of data loaded for the first time is too large. We can select a part of the data to evaluate the overall time. In actual processing, we can select a partition of the fact table. Generally, the data volume of each partition is similar. The time for evaluating a partition can be multiplied by the number of partitions as the overall evaluation time.
2. Evaluate the data conversion time
Data conversion is usually completed in the memory, generally with a very fast speed, a small proportion of the overall time. To evaluate the time required, the simplest evaluation method is to first evaluate the extraction time and loading time, then run the entire process, and subtract the extraction time and loading time from the overall time.
3. Evaluate the loading time
Loading time may be affected for many reasons. The two most important factors are indexes and logs.
The evaluation of the loading time can also select a part of the data to be loaded, such as 1/200, just like when the extraction time is evaluated. After calculating the time, multiply it by 200 as the overall loading time.
In short, it is very difficult to evaluate the loading time of large ETL data. The main method we use is analogy evaluation, that is, selecting a part of data to reduce the overall time for evaluation. During the evaluation, you should note that the configuration differences between the test environment and the production environment may lead to deviations in the evaluation results. Although there will be errors in the time evaluation, it can be used as a reference for the overall loading time.
28. Describe the architecture options for implementing real-time ETL.
Describes the architecture components that can be selected during real-time architecture ETL.
A: During data warehouse creation, ETL usually adopts the batch processing method. Generally, the batch is run every night.
With the gradual maturity of data warehouse technology, enterprises have higher requirements on the time delay of data warehouse, and now the real-time ETL (Real-Time ETL), which is often said ). Real-time ETL is a new part of the Data Warehouse field.
Several technologies are available when you build a data warehouse with a real-time ETL architecture.
1. microbatch ETL (MB-ETL)
The micro-batch processing method is similar to our common ETL processing method, but the processing interval is short, for example, processing once every hour.
2. Enterprise Application Integration (EAI)
EAI is also called function integration. It is usually used by middleware to complete data interaction. ETL is usually called data integration.
For systems with high real-time requirements, you can consider using EAI as an ETL tool to provide fast data interaction. However, when the data volume is large, the efficiency of using EAI tools is relatively poor, and the implementation is relatively complicated.
3. CTF (capture, transform and flow)
CTF is a new type of data integration tool. It uses a direct database connection method to provide data in seconds. The disadvantage of CTF is that it can only perform lightweight data integration. The common processing method is to establish a data preparation area and use the CTF tool to connect the source database to the database in the data preparation area. After the data enters the data preparation area, it is migrated to the Data Warehouse after other processing.
4. EIi (enterprise information integration)
EIi is another new type of data integration software that provides enterprises with real-time reports. The Processing Method of EIi is similar to that of CTF, but it does not migrate data into the data preparation area or data warehouse, but directly loads the data into the report after extraction and conversion.
When building a data warehouse with real-time ETL architecture, you can make choices or combinations in MB-ETL, EAI, CTF, EIi, and common ETL.
29. Explain the different real-time approaches and how they can be applied in different business scenarios.
Several different real-time ETL implementation methods and their applicability are briefly described.
A: Real-time Data Warehouses are not mature yet, and there are few success stories. The following describes how to implement the real-time data warehouse architecture.
1. EIi only
The EIi technology is used to replace real-time data warehouses. The data delay can be guaranteed to be about 1 minute, and the complexity of supporting data integration is low. Historical data cannot be saved.
2. EIi + static DW
With EIi technology combined with non-real-time data warehouses, data delay can be guaranteed to be around 1 minute, and data integration within one day is less complex, data integration a day ago can be highly complex. You can save historical data.
3. ETL + static DW
For normal ETL processing, the data delay is within one day. Supports highly complex data integration. Save historical data.
4. CTF + Real-time partition + static DW
The CTF technology is used to build a real-time data warehouse, with a data delay of about 15 minutes. Data integration is less complex. Save historical data.
5. CTF + MB-ETL + Real-time partition + static DW
Using CTF technology and MB-ETL to process data migration, data delay can be guaranteed to be around 1 hour, support data integration is more complex, save historical data.
6. MB-ETL + Real-time partition + static DW
The direct use of MB-ETL to establish real-time data warehouse, data delay can be guaranteed at about 1 hour, support data integration of high complexity, save historical data.
7. EAI + Real-time partition + static DW
Using EAI technology to build a real-time data warehouse ensures a data delay of about 1 minute, and supports a high degree of complexity of data integration. Save historical data.
The above list some real-time data warehouse architecture options, which are not very detailed, but just put forward an idea for everyone to find information for learning.
30. outline some challenges faced by real-time ETL and describe how to overcome them.
The difficulties and solutions of real-time ETL are briefly described.
A: The introduction of real-time ETL brings many new problems and challenges to the construction of data warehouses. The following lists some problems, some of which have specific solutions, some can only be considered in actual situations.
1. Continuous ETL processing puts forward higher requirements on system reliability.
2. The interval between discrete snapshot data becomes shorter.
3. slowly changing dimensions into rapidly changing dimensions.
4. Determine the data refresh frequency in the data warehouse.
5. The goal is only to generate reports or to integrate data.
6. Data Integration or application integration.
7. The point-to-point approach is centralized.
8. Determine the data refresh method of the front-end display tool.