How indexes are used in the Data Warehouse

Source: Internet
Author: User

The index of the Data warehouse is a tricky issue. If there are too many indexes, the data is inserted quickly but the query response is slow. If too many indexes, the data import is slow and the data is more storage space, but the query responds faster. The role of indexes in a database is to speed up the query, whether it is a traditional database or a data warehouse. This is especially true for tables with large data volumes and for complex queries that design edge joins. Before contacting the data warehouse is relatively small, here is just a little experience to introduce.

Of course, there are parameters such as Data Warehouse type, dimension table and fact table size, partitioning, ad hoc, and so on when creating the Data Warehouse index. These parameters determine the structure of your index. This article focuses on how to index relational tables in a data warehouse, and note the relational tables in the relational database, not the data tables in SSAS.

Dimension index

If you intend to index on the primary key of the dimension table, and the key is a surrogate key, it is not a natural or business key (such as a user name or ID). be careful not to establish a clustered index on the surrogate key of the dimension table or on the column that is being graduated.

A dimension table contains a natural or business key (such as a transaction code or ID), which we call a business key, which comes from the business system. Although the business key may not be unique, it is better to index the marked column (such as a user ID) for a slowly graduated dimension table, such as:

The clustered index in the dimension table of the user and the product is built on the business key, and through such an index, the query speed can be enhanced especially if the keys are used in the where statement. This key value is often used in a where expression to query the dimension data.

Creating a clustered index from a business key avoids lock escalation (for example, row lock to table lock, intent to rank it), because in the ETL process if there is a nonclustered index on the surrogate key and all the rows are added to the end of the file, a lock escalation may occur, if the exclusive lock is escalated from the row lock to the table lock, Then it will cause other read or ETL or general operation blocking or even deadlock, the final program timeout.

In, the date dimension and the time dimension have no external data source or business key. Consider using the YYYYMMDD and HHMMSSSSS formats as the primary key for two tables and establish a clustered index. This value guarantees the index order and simplifies the query scope in the fact table, and this key value also contains the date or time, no longer requires a specific time.

For large, slowly-changing dimension tables (for example, where you need to type new data), you might create a four-part nonclustered index that includes business keys, record start times, record end times, and surrogate keys. For efficiency and to prevent the storage from growing, use include to include the record end time and the surrogate key, as follows:

                              CREATE nonclustered INDEX mydim_coveringindex on  (Naturalkey, recordstartdate)                                                           INCLUDE (Recordenddate, Surrogatekey);

This index is useful for querying and manipulating historical data during the ETL process, reducing columns by nonclustered indexes and reducing unnecessary storage space. The relational database engine can obtain data directly from the index without having to directly access the dimension data, reducing the IO to improve query speed.

If there are other columns in the dimension table for querying, sorting, grouping, you can also create a nonclustered index, just as you would in a transactional database. If you have an embedded hierarchy in a dimension table, such as a class-subclass-Product ID hierarchy in the Product dimension table, consider indexing on the key values of the hierarchy, which significantly improves the data query and does not affect data import.

Indexing on a fact table

Similar to building indexes in dimension tables, it is necessary to consider the conditions of partitioning. You can set up a clustered index on a column of date or mixed date + time. Because BI analysis always uses a date/time component, the fact table contains a date or datetime column, and using a clustered index here will help build the cube. Also for this reason, data records are stored in the order of date or DateTime. The search for history has its advantages. If the fact table has more than one of these columns, you need to index it on the columns that are most frequently queried or built.

If you are partitioning on a date column, you can use a clustered index on that column. When found to create partitions and clustered indexes on the same column and to create indexes on filegroups that save partitioned fact tables, SQL Server automatically partitions the index with the fact table partition (for example, the index will have partition functions and columns that are the same as the fact table). When the index is partitioned by the fact table, the table is automatically aligned with his index, especially if you create partitions or switch partitions frequently, which is a lot easier.

Next, create a nonclustered index on the foreign key of each fact table, and consider a mix of foreign key and date keys, as shown in 1 to establish an index similar to Customerkey + DateKey. Queries with the same foreign key value will be sorted with time, which improves query speed. Note that you should consider maintaining relationship integrity when dealing with foreign keys.

Improving the Index Schema

As time changes, the data warehouse changes to accommodate organizational changes, and the index structure must be changed. Most data warehouses or BI systems are directly connected to relational tables, so you can use relational table-tuned business methods, such as evaluation queries and data blending, to adjust the index accordingly. If the relational data warehouse is used only to represent the SSAS structure, then the index we discussed earlier may not be required. SSAS is more inclined to use the same query over and over again, so you can use the Index Tuning Wizard or fine tune the query. A thorough evaluation of the index in the Data warehouse is strictly forbidden.

Summarize

This article simply describes how the relational data tables of a general data warehouse are indexed, but many times you have to index them based on actual requests, and sometimes you can't use them. Take into account the consumption and time efficiency of many aspects, or continue to be through the requirements of the production environment to change.

How indexes are used in the Data Warehouse

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.