Top 10 best practices for building a large Relational Data Warehouse

Last Update:2018-12-05 Source: Internet

Author: User

Tags bulk insert ssis

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Top 10 best practices for building a large Relational Data Warehouse

Writer:Stuart ozer, Prem Mehra, and Kevin Cox

Technical ReviewPerson:Lubor Kollar, Thomas kejser, Denny Lee, Jimmy may, Michael Redman, and Sanjay Mishra

Building a large relational data warehouse is a complex task. This article describes some design skills for using SQL Server to build an efficient large relational data warehouse. Because most large data warehouses use table partitions and index partitions, many of the suggestions in this article involve partitions. These skills are mostly based on the experience of using SQL Server 2005 to build a large data warehouse.

Partition large fact data tables

Consider partitioning a 50-100 Gb or larger fact data table.
Partitioning improves manageability and generally improves performance.
- Faster index maintenance and smaller granularity.
- The backup/restore options are more flexible.
- Faster data loading and Deletion
When the limit is within a single partition, the query speed is faster.
Generally, an application date is used as the partition key of a fact table.
- Application adjustable window.
The application query partition is eliminated.

Based onFact data tableOfDate keyGenerateClustered Index

This makes it more efficient to query a multi-dimensional dataset or retrieve historical data slices.
If you load data in the batch processing window, use the allow_row_locks = off and allow_page_locks = off options for the clustered index of the fact data table. This helps speed up the table scan operation during the query, and also helps avoid too many lock activities during a large number of updates.
Generates non-clustered indexes for each foreign key. This helps to "precisely locate the query", that is, to extract rows based on the selected dimension predicates. Use a file group to manage backup/restoration and availability of some databases.

Carefully select the partition Granularity

Most customers use months, quarters, or years.
For efficient deletion, you must delete a complete partition at a time.
Loading a complete partition at a time is faster.
- Daily partitioning may be a good choice for daily loading.
- However, remember that a table can have a maximum of 1000 partitions.
The partition granularity affects the query concurrency.
- The maximum parallel query limit for a single partition is MAXDOP (maximum degree of parallelism ).
- Search for multiple partitions using one thread for one partition. The maximum parallel speed is MAXDOP.
If MAXDOP concurrency is required (for example, MAXDOP = 4 or greater), you should avoid searching for only two or three partitions in common queries.

Properly design dimension tables

For all dimensions, except the date dimension, the integer proxy key is used, and the possible minimum integer is used for the dimension proxy key. In this way, the fact data table can be reduced as much as possible.
Use an Integer-type meaningful date key (such as 20060215) derived from the DATETIME data type ).
- Do not use the proxy key for the date dimension
- It is easy to write and execute a where clause query on this column to eliminate partitions in fact data tables.
Generates a clustered index for the business key (non-proxy key) of each dimension table.
- Supports quick search when loading fact data tables.
- Allows you to quickly search for existing dimension rows and change dimensions based on management type 2.
Generates a non-clustered primary key index for the dimension key (proxy key) of each dimension table.
Generates non-clustered indexes for dimension columns that are frequently searched.
Avoid partitioning dimension tables.
Avoid the primary key-foreign key relationship between a fact table and a dimension table. Allows fast loading. Use Transform Lookups to ensure integrity or perform integrity check at the data source.

Write effective queries to eliminate partitions

Place the query predicate (WHERE condition) directly on the partition key (date dimension key) of the fact data table.

Use the adjustable window method to maintain data

Maintain a rolling time window to access the fact data table online. Load the latest data and unload the oldest data.
Always keep the empty partitions at both ends of the partition range to ensure that the partition splitting (before loading new data) and partition merging (After detaching old data) operations will not lead to any data movement.
Avoid splitting or merging filled partitions. The efficiency of splitting or merging filled partitions is extremely low, as this may lead to an additional 4 times of logs and severe locks.
Create a temporary table for loading in the file group of the partition to be loaded.
Create and detach a temporary table in the file group of the partition to be deleted.
It is the fastest time to load the latest partition at a time, but only when the partition size is equal to the data loading frequency (for example, you have a partition every day and load data once a day ).
If the partition size does not match the data loading frequency, you should incrementally load the latest partition.
Different practices for loading large volumes of data to partitioned tables are discussed in the White Paper http://www.microsoft.com/technet/prodtechnol/ SQL /bestpractice/loading_bulk_data_partitioned_table.mspx.
Only one partition is detached at a time.

Effective loading of initial data

Use the SIMPLE or bulk logged recovery mode during initial data loading.
Create a partitioned fact data table with a clustered index.
Create a non-index temporary table for each partition and fill each partition with different source data files.
Pads temporary tables in parallel.
- Use multiple bulk insert, BCP, or SSIS tasks.
  - If there is no IO bottleneck, create the same number of loading scripts as the number of CPUs to run them in parallel. If I/O bandwidth is limited, less parallel scripts are used.
  - Use 0 batches for loading.
  - Use 0 to submit data during loading.
  - Use tablock.
  - If the source file is a flat file on the same server, use Bulk insert. If you want to push data from a remote computer, use BCP or SSIs.
Generate clustered indexes for each temporary table and create appropriate CHECK constraints. Do not use the SORT_IN_TEMPDB option.
Use SWITCH to SWITCH all partitions to the partition table.
Generates non-clustered indexes for partitioned tables.
In a SAN with a throughput of 14 Gbit/s, 1 TB (non-index table) can be loaded within one hour on 64 CPU servers ). For more information, see SQLCAT blog article http://blogs.msdn.com/sqlcat/archive/2006/05/19/602142.aspx.

Delete old data effectively

Use partition switch whenever possible.
Deleting a large number of rows from a non-partitioned index table,
- Avoid using Delete from... where.... Otherwise
  - Will generate a large number of locks and log records
  - If you cancel the deletion, the rollback takes a long time.
- Generally, the following methods are faster:
  - Insert records to non-index tables
  - Create an index for a table
  - Rename a new table to replace the original table name
Alternatively, you can use the following statement repeatedly in a loop to delete the statement in batches.
Delete top (1000 )...;
COMMIT

Another method is to update a row, mark it as deleted, and delete it when you are not busy.

Manage statistics manually

Maintain the statistical information of the partition table as a whole.
After loading new data, manually update the statistics of large fact data tables.
After the index is re-generated for the partition, manually update the statistics.
If the statistical information is updated regularly after the table is loaded periodically, you can disable autostats for the table.
This is very important for optimizing queries that only need to read the latest data.
Updating the statistics of small dimension tables after incremental loading may also help improve performance. Use the FULLSCAN option to update statistics for dimension tables to obtain more accurate query plans.

Effective backup policies

For a very large database, it may take a long time to back up the entire database.
- For example, backing up a 2 TB database to a raid-5 disk with 10 spindle on the San may take 2 hours (the speed is 275 MB/second ).
Snapshot Backup Using SAN technology is a good choice.
Regularly reduce the data volume to be backed up.
- The file group of the historical partition can be marked as read only.
- After the file group is marked as read-only, you only need to back up it once.
- Only regular backup is performed for read/write file groups.
Note that you cannot execute RESTORE operations on read-only file groups in parallel.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More