Introduction to DB2 V10.5 column storage
Before DB2 V10.5, data is stored in rows in a table. Starting with V10.5, DB2 provides support for column storage.
The benefit of row storage in relation to column storage is that it is easier to modify data and is suitable for OLTP transaction applications. However, row storage has performance limitations in some scenarios, such as many queries do not need to read all the columns in one table at a time, and row storage cannot avoid reading to unwanted columns because of the storage mechanism.
The performance of the above scenario can be improved by using column storage. The column store only needs to read the columns of the query in the query statement, thereby reducing the amount of data read. I/O occupies a very important place in the performance of the whole system. The reduction in data volume brings about a reduction in I/O, resulting in a performance boost. In addition, because the data is in column storage, the compression algorithm can more easily compress the data than improve, thus reducing the data storage space. Some compression algorithms can even operate directly on the data without the need for decompression, thus reducing the CPU consumption. In a nutshell, in some scenarios, column storage brings performance improvements and reduced resource consumption.
In some scenarios, column storage does not apply, such as transaction-type OLTP applications. Column storage is mainly suitable for OLAP type applications, such as Data Warehouse, data mining, decision support system, etc. There are some common features in these fields: large amount of data, large proportion of query statements, and more complex queries.
Starting with the DB2 V10.5, the function of column storage is provided, and when a table is created, the user can choose to use row or column storage and, if not specified, the default is row storage. The default value can be modified by dft_table_org database configuration parameters, and setting the value to column changes the default to store as columns.
Most of the commands for row storage remain valid on the column store, such as the primary foreign KEY constraint. The most typical difference between row and column storage is that the column store cannot create an index. No indexes are scary, and in many cases, column stores without indexes are several times or more likely to have performance than indexed row storage.
Indicators for dynamically selecting column storage tables
Whether to select a column store first considers the business type. Only OLAP type applications are suitable for column storage. Most of these applications have the following characteristics: Large amount of data, large proportion of query statements, complex query more.
After defining the business type, you can choose to use the column store based on the following metrics.
Amount of data
Query average amount of data accessed
Query average access larger than
The proportion of the query statement
In practical application, the column storage can be judged to be valid according to the following indexes.
Table size
I/O throughput
Query Statement Response time
Sort time
The details of each indicator are as follows:
Amount of data
The amount of data is one of the important metrics to measure whether or not to use column storage. The greater the amount of data in a database table, the more appropriate the table is to use column storage, especially if the query statement is large and the amount of data accessed is large, the use of column queries can significantly reduce SQL statement query time.
Query average amount of data accessed
The average amount of data accessed by a query is the average number of times a query is accessed, and how many of the data in the database are visited, including the data that is accessed but not actually used and the data actually used. For access to data that is not actually used, a typical example is a conditional column with no index. Row storage can use indexes to improve performance when accessing data, but there is still a performance problem when using indexes to read large amounts of data, compared to column storage. Column storage is ideal for querying a scenario in which the average amount of data accessed is large, and in this case the column storage performance is strong.
Query average access larger than
The average larger than of a query refers to the average number of columns and the proportions of the total columns accessed by all query statements. The smaller the ratio, the more appropriate for column storage. On average, the average larger than access is below the maximum value of 100%. For row storage, whether you access 10% columns or 80% of columns, because the data storage is stored in rows, all need to read out the entire row of data, virtually bring invalid I/O. Column storage, which is stored in columns, can read only the columns it needs to access, thereby reducing I/O and improving performance.
The proportion of the query statement
The proportion of the query statement is the proportion of the query statement and all statements, all of which include adding, deleting, changing, checking, etc. Column storage is more appropriate for OLAP applications, so the proportion of query statements is critical to successful use of column storage. The closer to 100% means the more appropriate for column storage.
Table size
The same data is not necessarily the same size in database tables, because the current database supports compression. Table size directly reflects the compression ratio. The larger the table, the greater the amount of data that the table has. The more data volume, the more effective compression. From an implementation perspective, column storage implementations are more efficient than row storage, so using column storage takes up less space.
Data access throughput
Column storage because of the high compression efficiency, and can reduce unnecessary column data read, so, after the column storage use, unit time, the amount of data needs to be accessed, thus reducing the data access throughput.
Query Statement Response time
Query response time is one of the most direct criteria for determining whether a column store is valid or not. Compare the response time of the row store and column store query statement to see if the conversion to the column store is valid. After a row store is converted to a column store, the response time for a query statement on a table that is appropriate for the column store is reduced significantly.