Interpreting SQL Server 2014 updatable Columnstore indexes-storage mechanism

Source: Internet
Author: User
Tags bulk insert sql 2014 sql server query

Overview

SQL Server 2014 is known as a revolutionary version of Microsoft's database, and its performance has been boosted to the greatest extent ever.

The updatable Columnstore index, as one of the key features of SQL Server 2014, is a significant contributor to improving query performance in the database. According to Microsoft statistics, in the OLAP query statistics class system, compared to other SQL traditional version of the database, report query performance can be increased up to 10 times times.

Let's look at the Updatable Columnstore Index for SQL Server 2014 from the storage perspective.

What is column storage

Microsoft has introduced Columnstore technologies in SQL Server 2012 in order to improve SQL Server query performance and better support big data analytics.

The essence of Columnstore is to split a table by a different column, and then each column is stored separately, so that the stored units are changed from each row to each column.

Like this table, 10 columns in the table are distributed across 10 pages, including the data from the table in the ROW1 to rown column C1 in the Page1, including Page2 to ROW1 columns rown in C2, followed by and so on.

Note: I have previously made a brief introduction to the blog, we can refer to this address: http://www.cnblogs.com/i6first/p/3217584.html

The benefits of doing this are:

    • Better data compression, reducing disk footprint

The higher the similarity of data, the more repeated values, the more obvious the effect of compression.
Each column in the Columnstore is of the same data type, expressing the same data concept (for example, gender) with high content repeatability, which makes compression more efficient than row compression and page compression.
This can not only save disk space effectively, but also can record more data in the same memory, improve the performance of the query.

    • Improve the performance of queries

In a query, we tend to just want to get the data from a column or columns of interest in the table,
The traditional query practice must scan all the data in the table to filter out the specified columns.
When the table contains more columns and more data, the efficiency of this query is very low.
However, if you are using Columnstore, because each column is centralized and stored independently of each other, the query only needs to scan the storage area where the specified columns are located, and you do not need to read other data that you do not need to column. The scope of the query is much smaller.

For example, this query:

Select  from Myuser1

Because all the C1 columns in the MyTable table are centrally stored, it is good to scan only the storage area of the C1 column at a time.

What is a Columnstore index

After you create a Columnstore index on a traditional table, you can implement column storage for the table.

In SQL Server 2014, there are two types of Columnstore indexes: column nonclustered index and column clustered index.

(1), column nonclustered index characteristics

    • As with nonclustered indexes, when creating a column nonclustered index, you must create a copy of the indexed column, taking up additional disk space, but because of the data compression, it takes up less space
    • Tables for column nonclustered indexes are not updatable

Note: Only nonclustered indexes can be built on SQL Server 2012, and clustered indexes cannot be built.

(2), column clustered index characteristics

    • The index page is the data page, the high compression rate greatly reduces the disk space occupation
    • The table can be updated
    • The column clustered index must be a unique index of the table, and if a clustered or nonclustered index exists in the table, you must delete the original index to create the Columnstore index
    • Only available in Enterprise, Dev, and evaluation editions
    • The physical storage order of the columns is not changed, and is intended primarily for performance improvement and for high data compression

Note: Readers can access this address to learn more about the features and usage limitations of column-clustered indexes. In addition, the described column store is a storage structure that contains a clustered index of columns, if not specifically described below.

To create a column-clustered index

Columnstore indexes do not need to specify column names when they are created, and once the index is created, all rows in the table are stored as columns.

CREATE CLUSTERED INDEX  on  GO

To better illustrate the process of "row table"--"Columnstore" when the column clustered index was created, I made a diagram and explained each sequence number as follows:

①: First, the tables are split into one or more row groups (row group);

In general, the number of rows in each row group must meet a minimum of 102,400 and a maximum of 1,048,576 before it can be converted to Columnstore. However, if you create a Columnstore index directly on a table, this rule can be "ignored" because even if the table has fewer than 102,400 rows, you can form a row group that can be transferred to a column store.

In fact, it can be understood simply, because when the index is created, SQL Server cannot wait until the number of rows is increased to 102,400 to form a column store.

Here I show an example where table MyUser1 has 102 rows of data, and I create a Columnstore index on the table, and you can see that there is only one row group with 102 rows, and that row group has been converted to a column store.

SELECT COUNT (*) as Rows_count from Myuser1
SELECTI.object_id,object_name(i.object_id) asTableName, I.name asIndexName, Csrowgroups.state_description,csrowgroups.row_group_id,csrowgroups.total_rows, CSRowGroups.*,  -*(total_rows- ISNULL(Deleted_rows,0))/Total_rows asPercentfull fromSys.indexes asIJOINSys.column_store_row_groups ascsrowgroups onI.object_id =Csrowgroups.object_id andi.index_id=csrowgroups.index_id--WHERE object_name (i.object_id) = ' <table_name> 'ORDER  by object_name(i.object_id), I.name, csrowgroups.row_group_id;

Figure, the table has only one row group, the row group ID is 0 (row_group_id), because State_description is compressed (the meaning of State_description value is described in detail below), Indicates that the row group has been stored in column mode.

②: divides row groups into columns by column;

When a row group reaches the specified size (102400-1048576), it must be split by column, and each column forms a block of columns. Each column block contains all the data for this column.

Looking at my example, there are 33 columns in table MyUser1, and there will be 33 column blocks as defined by the column block.

SELECT max_column_id_used from SYS. TABles WHERE object_id=object_id (' MYUSER1 ')
SELECTI.name, p.object_id, p.index_id, I.type_desc,COUNT(*) asnumber_of_segments fromSys.column_store_segments assINNER JOINSys.partitions asP ons.hobt_id=p.hobt_idINNER JOINSys.indexes asI onP.object_id =I.object_idGROUP  byI.name, p.object_id, p.index_id, I.type_desc;GO

Diagram: The number of columns in the Myuser1 table is 33.

Figure: The number of columns in the Myuser1 table (number_of_segments).

③: each column block is compressed and stored on the physical disk;

The formation of the column blocks in step 2 is not an end, but a means.

Column blocks must be compressed before they can really be stored in columns, and according to MSDN, the resulting column storage can save up to 7 times times the disk space.

In the example below, I made two tables with a clustered index on the simpletable, there are no clustered indexes on the Simpletable_nocci, there are only clustered indexes, each table contains 1048577 rows of data, and the data content is exactly the same.

We use sp_spaceused to see the disk space usage of the next two tables,

' Simpletable_nocci ' GO  'simpletable'GO

Obviously, Simpletable_nocci's disk space is more than twice times brighter than simpletable.

Insert and BULK INSERT

Rows inserted in tables with column-clustered indexes need to go through the process of row-row group-column-block-column storage. This is different from inserting data in a traditional table:

Let's look at the two SQL Server insert methods to understand the process of inserting data into a column store.

(1) INSERT

Insert, also known as trickle Insert, we usually use INSERT into is trickle insert.

In SQL 2014, the rows for each insert are not written directly to the Columnstore.

Because this produces a lot of index fragmentation, and this fragmented insert does not get a good compression effect, affecting the performance of Columnstore and queries.

For these newly inserted data, the Delta Store temp table is introduced in SQL 2014.

The newly inserted rows are stored in the Delta store as rows, and can be retrieved through B-tree. It also says that the DELTA store is actually the same as the storage structure of the traditional table, and it is also row-type storage.

When the number of rows in the Delta store reaches the required 1048,576 lines, the row group is marked as clsoed, and no new data is allowed to be inserted.

Then the SQL 2014 background process move truple scans to the Clsoed row group, moves the row group from the Delta store to the column store, and finally marks the row group as compressed.

The row of the RowGroup1 has reached the maximum value of the line group, which is marked as closed, which indicates that new data cannot be inserted.

The row of RowGroup2 is less than 1048576 rows, even if the minimum value required by the row group is met, it is still in the open state until the maximum row group size is reached.

The status of the row group can be queried by the following statement:

SELECTI.object_id,object_name(i.object_id) asTableName, Csrowgroups.state_description fromSys.indexes asIJOINSys.column_store_row_groups ascsrowgroups onI.object_id =Csrowgroups.object_id andi.index_id=csrowgroups.index_id--WHERE object_name (i.object_id) = ' <table_name> 'ORDER  by object_name(i.object_id), I.name, row_group_id;

(2) BULK INSERT

Bulk Insert can be understood as a high-performance insert method, Bulk Insert is often used for big data import operations, its performance is much better than trickle insert,

(Readers who are interested can verify by themselves that inserting data from peers, whichever is faster.) )

It is for this reason that Bulk insert has a different way of inserting data into the Columnstore than with trickle insert.

When the data for a bulk insert reaches the minimum value of 102,400 for a row group, the row group can be stored directly as a column without the Delta store.

What we need to draw attention to here is that the trickle Insert's Columnstore formation process requires not only a row group to reach 1048576, but also the Delta store.

This shows that in large batches of data import, Bulk Insert is the preferred method.

In the following example, I import a t1.txt with 102,400 through BULK INSERT into the database, and you can see that the result of that row group changes directly to compressed.

Bulk InsertBulkinsertfrom'd:\temp\t1.txt'
    

The t1.txt format is as follows:

However, if a bulk insert data is larger than one row group but less than two row groups, the extra portion of the data must also be stored in the Delta store.

Delete and update

Because a table that has a clustered index may contain both the Delta store and the Columnstore for the row store, two different regions will differ when processing deletes and updates.

(1) Delete

Let's look at the delete operation first:

    • If the deleted row is in the Columnstore, SQL Server simply deletes it logically, and the physical space it occupies does not release

The delete bitmap for SQL Server 2014 is a table that keeps track of every record deletion in the Columnstore, and is stored in the same way as the Delta store, based on rows and B-tree.

When a row needs to be deleted, delete bitmap marks the bit that corresponds to the row as a delete state, and the physical area in which the row is actually not changed.

This requires that all query statements must first be scanned for delete bitmap, and the records that have been recorded for deletion are not to be found in the physical store, nor should they appear in the results of the query.

    • If the deleted row is in the Delta store, this is no different from the way traditional row storage is deleted, and SQL Server removes the data both logically and physically.

(2) Update

Once we understand the insert and delete, it's very easy for us to see the update again.

    • If update occurs in the column store

SQL Server marks the row in delete bitmap as the deletion state, inserting a new row into the Delta store.

    • If the update occurs in the Delta store

SQL Server updates the data in this row directly in the Delta store.

Conclusion

Although the Columnstore for SQL Server 2014 already supports data updates, it does not mean that reports in a production environment can benefit from it.

Columnstore is inherently designed for OLAP, and its data characteristics tend to be static, and even if it's data import, Microsoft also recommends using BULK INSERT,

So if there is a large number of operations in the database, such as adding, deleting, changing and so on, using Columnstore technology may backfire.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.