ETL Interview FAQ

Last Update:2018-07-23 Source: Internet

Author: User

Tags oracle cursor snowflake schema

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. The difference between source Qualifile and joiner

The source qualifier can implement n isomorphic data source associations, and joiner components can implement 2 heterogeneous data source associations. The former can only correlate isomorphic data, it is implemented in the source database, the latter can correlate isomorphic data, but it is mainly used to correlate heterogeneous data sources, and the associated operations are implemented in the Informatica cache. n Heterogeneous data sources, you need to N-1 a joiner component to implement the association.

Heterogeneous data Source: in the Data Warehouse project, the data that needs to be extracted often comes from different data sources, and their logical structure and physical structure may be different, that is called heterogeneous data source.

2. Source Qualifier and filter Components

Source Qualifier is used to extract data from sources and source files, which can only be used to filter the data in the source table, but not the text. In order to improve performance, you should try to filter out the data in SourceQualifier.
Filter is used to filter data that has been read by informatic, and can only be filtered using the filter component for text files.

3. The two uses of the lookup component.

Cached Lookup and uncashed lookup defaults to Cachedlookup Cached first reads records into memory, and if the Lookup Association table has a larger amount of data than 1 million records, Cachedlookup is not recommended. Cached estimate: The amount of lookup data multiplied by the number of bytes.

4. Normalizer component for column to row conversion

5. How to deal with multiple documents : FileList??

If it is the same structure batch of files, you can use the FileList function.

is in the source folder inside build a list.txt, the file is written in the inside, and then in Workflow Sourcefiletype elected to indirect it.

Build a command first.
LS *filename. TX >file_name_list.txt
Then use this file as your source, and remember that the type of source is indirect rather than direct.

1. Incremental extraction:

1. Using Audit Columns

An audit column refers to a field in a table that contains information such as add date, modified date, and modify person. When an application operates on the data for that table, it updates the fields, or creates triggers to update the fields. The advantage of data capture in this way is convenient and easy to implement. The disadvantage is that if the operating system does not have the corresponding audit field, it is necessary to change the existing operating system data structure, to ensure that the acquisition process involved in each table has audit fields.

2. Database Log

DBMS log capture is a data that can be changed by a log system provided by the DBMS. The advantage is that it has minimal impact on the database or on the operating system that accesses the database. The disadvantage is that DBMS support is required, and the format of logging is well understood.

3. Full table Scan

Scan comparisons can also be performed after a full table scan or a full table export file, especially when data is captured. The advantage of this method is that the idea is clear, the adaptability is wide, the disadvantage is that the efficiency is poor

2. The concept of fact table and dimension table and how to design

fact table:

Ø Each data warehouse contains one or more fact tables. Fact tables may contain business sales data, such as data generated by cash register transactions, and fact tables typically contain a large number of rows

Ø The General fact table only holds numbers or some flag to count (count), such as revenue, quantity, expense, etc.

dimension table (Dimension table):

Ø a dimension table can be viewed as a window for the user to analyze the data. Dimension tables contain the attributes of fact records in fact tables, some provide descriptive information, and some attribute specify how to summarize fact table data to provide useful information to the analyst, which contains a hierarchy of attributes that help summarize the data.

granularity (grain) level (hierarchy):

Ø granularity refers to the level of refinement or synthesis of data stored in a data warehouse. The higher the degree of refinement, the smaller the granularity, on the contrary, the lower the level of refinement, the greater the granularity. Design granularity is an important prerequisite in the design of Data Warehouse.

Ø hierarchy refers to the level of detail data

three types of models in dimension modeling:

Ø star Shape Model (STARSCHEMA)

Ø Snowflake Model (Snowflakeschema)

Ø Multidimensional Model (MULTI-DIMENSIONSCHEMA)

some factors that affect dimension modeling:

Ø data or presentation of security

Ø Complex Query and analysis

star Model (stars Schema):

Ø The fact is surrounded by dimensions and the dimension is not joined by a new table

Ø The star model is a relatively eclectic modeling approach (Biapps is used in Star modeling)

Snowflake Model (Snowflake Schema):

The fact table is surrounded by multiple dimension tables or one or more levels

The snowflake model is typically used when dealing with large and relatively static levels.

Multidimensional Model (MULTI-DIMENSIONSCHEMA):

Hierarchy database, only one structure (Cube cube) is equivalent to a multidimensional array. It contains a summary of all the data at various levels

Requires support for a specific multidimensional database or multidimensional Database engine (Essbase)

Problem with data storage space: When a new dimension is added, the amount of data is exponentially increased

type of dimension:

Slow change dimension (slowly changing Dimension)

Fast change dimension (rapidly changing Dimension)

} David (Huge Dimension) and Mini dimension (Mini-dimension)

Degenerate dimension (degenerate Dimension)

Slowly Changing Dimension (SCD):

The content of most of the dimensions will change in varying degrees. Like what:

Promotion of employees

The customer changed his name or address

How do we deal with the changes in these dimensions?

Three ways to handle slow-changing dimensions are provided below

Update directly to the original record

Mark the start and end dates that record valid times, add version control

Add a field in the record to record history

Fast Change Dimension (FCD):

When a change in a dimension is very fast, we identify him as a fast-changing dimension (depending on the actual frequency of change), such as:

The price of the product, the price of the property, etc.

Changes to this fast-changing dimension should be captured in the implementation rather than in the dimension

Large Dimension (hugedimension):

The most interesting dimensions in the Data warehouse are some very large dimensions, such as customers, products, and so on. A large enterprise customer dimension often has millions of records, each with hundreds of fields. While larger personal customer dimensions will exceed thousands records, these individual customer dimensions can sometimes have more than 10 fields, but most of the less common dimensions have few attributes.

The large dimension requires special processing. Because of the large amount of data, many data warehousing functions involving large dimensions can be slow and inefficient. You need to adopt efficient design methods, choose the right index, or use other optimization techniques to address the following issues, including:

To populate a large dimension table with data

The browsing performance of unrestricted dimensions, especially those with fewer attributes

Browse time for multi-restricted dimension attribute values

Low-efficiency problem for fact table queries involving large dimension tables

Additional records required to handle the second type of modification

Mini Dimension (minidimension):

Extract a few fields from a common large dimension to form a small field dimension that can be used in a query with a field in a mini dimension

This design significantly improves query efficiency

type of fact:

Granularity fact table (additive Fact)

Cycle Snapshot fact table (semi-additive Fact)

Aggregation Snapshot fact table (non-additive Fact)

Non-factual fact table (factless Fact table)

Granularity fact table (additivefact):

Represents an instantaneous measurement at a specific time and space point. With the same level of granularity of the fact table, you can directly sum,count the fact field and other aggregation operations

In the design of the fact table, it is important to note that a fact table can have only one granularity and that different granularity of facts cannot be established in the same fact table.

The source of the transaction Granularity fact table is the data accompanying the transaction event genetic, such as sales orders. In the process of ETL, the particle size of the atom is migrated directly.

Cycle Snapshot fact table (semi-additivefact)

The periodic Snapshot fact table shows a time period, or a regular repetition. Such tables are ideal for tracking long-term processes, such as bank accounts and other forms of financial statements. The most commonly used financial periodic snapshot fact table usually has a granularity of one months. The data in the periodic Snapshot fact table must conform to that granularity (that is, they must measure the activity in the same time period). For a good periodic snapshot fact table, there is more truth in granularity.

During the ETL process, cumulative data is generated at a fixed time interval.

Aggregation Snapshot fact table (non-additivefact):

The Aggregation Snapshot fact table is used to describe processes that have a clear start and end, such as contract fulfillment, policy acceptance, and common workflows. Aggregate snapshots are not suitable for long-term continuous processing, such as tracking bank accounts or describing continuous manufacturing processes, such as paper making.

The granularity of the aggregation snapshot fact table is the complete history of an entity from its creation to the current state.

During the ETL process, the records in the table are progressively refined as the business process steps.

non-factual fact table (factlessfact table):

The granularity of each fact table is an event measurement. Used to describe data or events. An event can occur, but there is no specific measure.

the general process of modeling a dimension:

} 1 determine the granularity of each fact table

} 2 Determining the attributes of a dimension

} 3 Determining the level of the dimension

} 4 determine the dimensions that each fact needs to be associated with

5 determine facts, including pre-calculated

} 6 determine slow change dimension

determine the granularity of each fact table

Determines the granularity level of detailed data

This process must be a problem that needs to be considered before modeling

The typical granularity refers to a single, time based or aggregated transaction in a common dimension

Determining the attributes of a dimension

To determine whether you need to store numbers and descriptions at the same time, or just numbers, or just descriptive information

Determine which fields have values that need to be filtered out or that need to exist

Determining the level of a dimension

For the time dimension, we need to determine the different levels of year, quarter, month, week, day, etc.

For product dimensions, we need to determine the product categories, product categories, products, such as different levels

It should be noted that, for example, in sales, the geographical level may be different from the real geographical level

determine the dimensions that each fact needs to be associated with

The usual dimensions include time, product, policy holder, agent, and geographic common objects

Note that the dimension you create needs to be consistent with the granularity of the fact that it is connected to

determination of facts and measures (including pre-calculated facts)

Need to determine facts and metrics based on specific business

For each aggregation fact, it needs to be evaluated in the application (ETL) process

Determine slow change dimension

According to the requirements, the slow change dimension is handled accordingly

For a customer dimension that needs no history, we use the first type of slow change dimension to handle

For a product dimension that needs to retain history, we need to use the second type of slow-change dimension to handle

In the conventional data flow route, the third way does not often occur, instead, they are often an ETL team members of the verbal execution of decisions

3. Partition Table

range partition table; list partition table; hash partition table; Combined partition table

Range partition table: typically based on dates and values, range partitions are associated with partitions, assuming a natural range of values for the partitioning column, it is not possible to organize partitions outside the range of the value. Multiple columns can be partitioned

Partition by Range (Sales_date)--Creates a date-based range partition and stores it in a different table space

(

Partition sal_jan2000 valuesless than To_date (' 02/01/2000 ',

' dd/mm/yyyy ') tablespace sal_range_jan2000,

Partition sal_feb2000 valuesless than To_date (' 03/01/2000 ',

' dd/mm/yyyy ') tablespace sal_range_feb2000,

Partition sal_mar2000 valuesless than To_date (' 04/01/2000 ',

' dd/mm/yyyy ') tablespace sal_range_mar2000,

Partition sal_apr2000 valuesless than To_date (' 05/01/2000 ',

' dd/mm/yyyy ') tablespace sal_range_apr2000

);

Partition by is used to specify the partitioning method

Range indicates how partitions are divided by scope

Partition PN is used to specify the name of the partition

Values less than the upper bound of the specified partition (upper bound)

To add a partition:

ALTER TABLE R

Add partition P5 values less than (XXX) tablespace xx;

list partition table: The advantage of the list partition is that it groups the unordered and unrelated sets of data in a natural way. However, multiple-column partitions are not supported, and if you partition the table by column, the partitioning key can have only one separate column of the table.

Hash partition: Hash partition A hash partition can easily partition data because the syntax is simple and easy to implement,

Hash partitioning is not allowed to control the partitioning of data because the system uses hash functions to divide the data. Multiple columns can be partitioned

Splitting, deleting, and merging partitions cannot be applied to a hash partition, but a hash partition can be merged and added.

There are two ways to create a hash partition: One method is to specify the number of partitions, and the other is to specify the name of the partition, but the two cannot

specified when.

Method One: Specify the number of partitions

CREATE TABLE Dept2 (Deptno number,deptnamevarchar2 (32))

Partition by hash (DEPTNO) partitions 4;

Method Two: Specify the name of the partition

CREATE TABLE DEPT3 (Deptno number,deptnamevarchar2 (32))

Partition by hash (DEPTNO)

(Partition P1tablespace P1,

Partition P2 tablespace p2);

combined partitions: combined partitions using the Range method partition, using the hash method for partitioning in each child partition

Partition by range (sales_date)

Subpartitionby Hash (salesman_id)

Subpartitions4

Storein (TBS1,TBS2,TBS3,TBS4)

(Partitionsales_jan2000values less than (to_date (' 02/01/2000 ', ' dd/mm/yyyy ')),

Partition sales_feb2000 valuesless than (to_date (' 03/01/2000 ', ' dd/mm/yyyy ')),

Partition sales_mar2000 valuesless than (to_date (' 04/01/2000 ', ' dd/mm/yyyy '))

);

4. materialized View

A materialized view is a database pair that includes a query result, which is a local copy of the remote data, or is used to generate a summary table based on the sum of the data tables. Materialized views store data that is based on remote tables, or it can be called snapshots.

5. Cursors

Cursors (CURSOR) are also called cursors, which are often used in relational databases, where the data in a table or view can be queried and read-by-line in a pl/sql program with a CURSOR and a select.

An Oracle cursor is divided into a display cursor and an implicit cursor.
Display cursor (EXPLICITCURSOR): A cursor that is defined in a Pl/sql program and used for querying is called a display cursor.
Implicit cursors (implicitcursor): A cursor that is automatically allocated by an Oracle system when a UPDATE/DELETE statement is defined in a Pl/sql program and is used in Pl/sql.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More