Data compression Brief

Last Update:2016-07-13 Source: Internet

Author: User

Tags bulk insert filegroup one table

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. decide which objects to compress

Sp_estimate_data_compression_savings evaluates the amount of space that is saved separately from row and page compression.

When the table contains the following data patterns, it has a good compression effect:

Columns of numeric types and fixed-length character type data, but most of the values of both do not use all the bytes of this type. Values such as int columns are mostly less than 1000.
Nullable columns are allowed with many null values
There are many same values or the same prefixes in the column values.

When the table contains the following data pattern, the compression effect is poor:

column of numeric type and fixed-length character type data, but most of the values of both will run out of all bytes of this type.
Very small number of duplicate values
Duplicate values do not have the same prefix
Data store out-of-row
FILESTREAM data

2. Evaluate the application load pattern

The compressed page is compressed on both disk and memory . The following two scenarios will be decompressed (not full page decompression, only the relevant data):

Because the filtering, sorting, joining operations in the query are read
Updated by Application

Decompression consumes the CPU, but data compression reduces physical IO and logical IO while improving cache efficiency. For data scanning operations, the reduced IO volume is considerable. For a single lookup operation, less IO is reduced.

The CPU overhead caused by row compression typically does not exceed 10%. If the current system resources are sufficient and the increase in 10%CPU is not stressful, it is recommended that all tables be enabled for row compression.

Page compression is a bit more expensive than row compression, so it can be difficult to determine whether to use page compression. There are some simple guidelines that can help us to judge:

Start with tables and indexes that are not commonly used
If the system does not have enough CPU headroom, do not use page compression
Because filtering, joins, aggregates, and sorting operations use decompressed data, data compression does not help much with this type of query. If the workload consists primarily of very complex queries (multi-table joins, complex aggregations), page compression does not improve performance, most notably saving storage space.
In large data warehouse systems, scanning performance is the focus, while the cost of storage devices is high, with the CPU performance is allowed, it is recommended to use page compression for all tables.

You can use two finer metrics to help us evaluate what data compression is used:

U: The percentage of all operations that the update operation for a particular object (table, index, or partition) represents. The lower the more suitable for page compression.
S: The percentage of all operations that are scanned for a particular object (table, index, or partition). The higher the more suitable for page compression.

Query the U of all objects of the database with the following script:

SELECT o.name as [table_name], x.name as [index_name],

I.partition_number as [partition],

i.index_id as [index_id], x.type_desc as [Index_type],

I.leaf_update_count * 100.0/

(I.range_scan_count + I.leaf_insert_count

+ I.leaf_delete_count + i.leaf_update_count

+ I.leaf_page_merge_count + i.singleton_lookup_count

) as [percent_update]

From Sys.dm_db_index_operational_stats (db_id (), NULL, NULL, NULL) I

JOIN sys.objects o on o.object_id = i.object_id

JOIN sys.indexes x on x.object_id = i.object_id and x.index_id = i.index_id

WHERE (I.range_scan_count + i.leaf_insert_count

+ I.leaf_delete_count + leaf_update_count

+ I.leaf_page_merge_count + i.singleton_lookup_count)! = 0

and OBJECTPROPERTY (i.object_id, ' isusertable ') = 1

ORDER by [Percent_update] ASC

Query the database for the s of all objects by using the following script:

SELECT o.name as [table_name], x.name as [index_name],

I.partition_number as [partition],

i.index_id as [index_id], x.type_desc as [Index_type],

I.range_scan_count * 100.0/

(I.range_scan_count + I.leaf_insert_count

+ I.leaf_delete_count + i.leaf_update_count

+ I.leaf_page_merge_count + i.singleton_lookup_count

) as [Percent_scan]

From Sys.dm_db_index_operational_stats (db_id (), NULL, NULL, NULL) I

JOIN sys.objects o on o.object_id = i.object_id

JOIN sys.indexes x on x.object_id = i.object_id and x.index_id = i.index_id

WHERE (I.range_scan_count + i.leaf_insert_count

+ I.leaf_delete_count + leaf_update_count

+ I.leaf_page_merge_count + i.singleton_lookup_count)! = 0

and OBJECTPROPERTY (i.object_id, ' isusertable ') = 1

ORDER by [Percent_scan] DESC

These two queries were used by the DMV sys.dm_db_index_operational_stats. The DMV only records the accumulated value since the last instance of SQL Server was started, so choose a suitable time to query in the actual application.

Usually u<20% and s>75% are reasonable considerations for enabling compression, but for a pipelining table that only inserts ordered data, page compression is appropriate (even if the S value is low).

3. Assess Resource requirements

Use Alter TABLE ... Rebuild and Alter INDEX ... Rebuild enables compression on tables and indexes, and other principles are the same as rebuilding indexes. The resources typically required include space, CPU, IO, space requirements

During compression, the compressed table and the uncompressed table coexist, and the uncompressed table is deleted and the space is freed only after the compression is complete. If rebuild is online, there is also mapping index that requires additional space.

The spatial requirements of a transaction are determined by whether the compression method is online (on or OFF) and the recovery model of the database.

When Sort_in_tempdb=on (recommended on), in order to implement concurrent DML operations, the internal structure of index is mapping in TEMPDB to map the relationship between the old book sign and the new bookmark. For versioned storage, the usage of tempdb is determined by the amount of data and the length of transaction time involved in concurrent DML operations.

Typically, the CPU overhead of a row compression operation is about 1.5 times times the rebuild of an index, and page compression is 2 to 5 times times its size. The online mode also requires additional CPU resources. Rebuild and compress can be parallelized, so they should be considered together with MAXDOP.

Considerations for Parallelization:

When SQL Server create/rebuild/compress an index, the statistics for the first column (leftmost column) of the index are used to determine the distribution of parallel operations across multiple CPUs. Therefore, Parallelization has little help with performance gains when the first column of the index is not highly filtered, or if data skew is so severe that the first column has few values.
The Compress/rebuild heap table is a single-threaded operation using the Online=on method. However, the table scan operation before compression and rebuild is parallel multithreading.

The following table summarizes the resource costs for compressing and rebuilding a clustered cable:

X = number of pages before compression or rebuild
p = Number of pages after compression (P < X)
Y = new and updated page data (only applicable to Online=on when concurrent application modifications are made)
M = size of Mapping index (based on <empdb capacity planning> white Paper's pre-estimate)
C = CPU time required to rebuild the clustered index

650) this.width=650; "title=" image "style=" border-left-0px; border-right-width:0px; Background-image:none; border-bottom-width:0px; padding-top:0px; padding-left:0px; padding-right:0px; border-top-width:0px "border=" 0 "alt=" image "src=" http://s3.51cto.com/wyfs02/M00/84/1C/ Wkiom1ef4jhj-vf-aacu3tijfws821.png "width=" 862 "height=" 365 "/>

Here are some reference points when deciding when and how to compress data:

Online vs. Offline:

Offline is faster and requires less resources, but locks the table during a compression operation. Online itself will have some limitations.

Compress one table/index/partition vs. multiple operations concurrently:

This is determined by the margin of the current resource, if the resources are sufficient, multiple compression operations in parallel can also be accepted, or preferably one at a time.

Order of table compression operations:

Small table compression requires less resources and completes quickly. The resources released after completion are also advantageous for subsequent table compression operations.

Sort_in_tempdb= on or OFF:

Recommended on. This allows tempdb to be used to store and complete the mapping index operation, which also reduces the space requirements for user data.

Compression Operation side Effects:

The compression operation includes the rebuild operation, so the fragmentation on the table or index is removed.
When compacting a heap table, if there is a nonclustered index present, then: When Online=off, index rebuild is a serial operation, Online=on, index rebuild is and operates.

4. Maintaining compressed data

How to compress the newly inserted data

650) this.width=650; "title=" image "style=" border-left-0px; border-right-width:0px; Background-image:none; border-bottom-width:0px; padding-top:0px; padding-left:0px; padding-right:0px; border-top-width:0px "border=" 0 "alt=" image "src=" http://s3.51cto.com/wyfs02/M01/84/1C/ Wkiom1ef4jkhlb6paafgtifocs4511.png "width=" 851 "height=" 290 "/>

* convert row-level compressed pages to page-level compression by rebuilding the heap table in a page-compression manner.

** in page compression, not all pages are page compressed, only if page compression saves more space than a memory threshold.

Update and delete compressed rows

All updates to row compressed table/partition data rows are persisted in row compression format. Not every time an update to a page-compressed table/partition's data row results in a column prefix and a page dictionary being recalculated, it is recalculated only if the number of updates on is above an internal threshold.

behavior of the secondary data structure

Table compression	Transaction Log	Mapping index for rebuilding the clustered index	Sort pages for queries	Version Store (with SI or RCSI isolation level)
ROW	ROW	NONE	NONE	ROW
PAGE	ROW	NONE	NONE	ROW

Page-Compressed index of non-page-level pages is row compressed

The non-page level of the index is relatively small, and even if page-level compression is applied, the space savings will not be significant. Access to non-page-level pages is frequent, and row-level compression is used to reduce the decompression cost per access.

5. Reclaim free space released by data compression

Do not recycle, save for the data that will grow in use. This is not suitable for partitioned tables (each partition corresponds to a different file level) of the read-only partition, compressed the old read-only partition will not grow, compression can save a lot of space.
DBCC SHRINKFILE (or DBCC SHRINKDATABASE). This operation can result in a lot of fragmentation, and it is a single-threaded operation that can take a long time.
If you compress all the tables on a filegroup, create a new filegroup, and then move the tables and indexes to the new filegroup when you compress. Data movement can be achieved by create/recreate the clustered index (for example, with (Data_compression=page, Drop_existing=on, Sort_in_tempdb=on) on [Fg_new]). After you move the data, delete the original filegroup. However, you cannot move lob_data data to a new filegroup in this way.
Create a compressed table on the new filegroup, and then import the data into these tables.

6. BULK INSERT and data compression

BULK INSERT with (TABLOCK) imports data to the compressed table, the fastest. Obviously, this will lock the table.

When compressing data, BULK inserts and the order in which the clustered indexes are created are considered:

Serial number	Way	Comparison
1	BULK Insert imports data into the uncompressed heap table, and then CREATE CLUSTERED INDEX with (data_compression = PAGE).	Time Required: 1<2<3
2	BULK Insert import data into a page-compressed heap table, and then CREATE CLUSTERED INDEX	Space required: 1>2>3
3	BULK Insert Import data to page-compressed clustered index

7. data compression and partition table maintenance

1. The switch operation requires that the destination partition (or target table) be compressed in the same way as the source partition.

2. Partition after split inherits the compression method of the original partition.

3. Merger operation, the deleted partition is called the source partition, and the partition receiving the data is called the target partition:

How the destination partition is compressed	How data is merged into the destination partition
NONE	During merger, the data is decompressed to the target partition
ROW	Data is converted to compressed format during merger
PAGE	-Heap table: data will be converted into compressed format during merger -Clustered index: During merger, data is converted to a page compression format

PS: Partition table merger operation rules

1. When left range, delete the partition where the boundary value is located, leave the partition on the right side and move the data to it

2. Right range, delete the partition where the boundary value is located, leave the partition in the "left" and move the data to it

8. data compression and transparent data encryption (TDE)

TDE is encrypted when the data page is written to disk, and decrypted when the read-out page in the disk is put into memory. Data compression/Decompression operations are performed on pages in memory, so data compression/decompression is always used for decrypted pages. So the interaction between the two is very small.

Summarize

1. This article is based on the white Paper <data compression:strategy, capacity planning and best practices> 's simplified translation and summary. This white paper is based on SQL Server 2008.

2. Data compression is an undervalued SQL Server technology that is considered by individuals to be one of the most standardized best practices.

Data compression Brief

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More