Row and column storage

Source: Internet
Author: User

A traditional row database stores a complete set of data rows in a data page. In this way, the following problems occur when you query large data volumes

1, in the case of no index, will be all out of a row, the query will use a lot of IO2, although the index and materialized views can be quickly ranked, but also need to spend a lot of time but if you need to process the query to use most of the data columns, this way on disk IO is more efficient.

In general, OLTP (online Transaction processing, on-line transaction processing) applications are appropriate for this approach.

A query of an OLAP type might require access to millions of or even billions of rows of data, and the query tends to concern only a few columns of data. For example, for the top 20 products that sell the highest this year, this query only cares about three data columns: Time (date), item (item), and sales (sales amount). Other data columns of the product, such as the URL of the product, the description of the product, the store to which it belongs, are meaningless for this query.

For example, a column database holds the values of the same data column together. When you insert a data row, the values for each data column of that row are also stored in different places. The column database in the example above only needs to read the data columns that store "time, goods, and sales", while the row database needs to read all the data columns. Therefore, the column-type database greatly improves the efficiency of OLAP large data volume query. Of course, a column database is not omnipotent, and each time a data row is read, the values of each data column are read separately from each other and then merged together to form a data row. Therefore, a column database does not apply if the amount of data involved in each query is small or if the majority of queries require the entire row of data.

  

Many of the column databases also support column groups (called Locality group in the columns group,bigtable system) to store the values of multiple data columns that are frequently accessed together. If the data column being read belongs to the same column group, the column database can read the values of multiple data columns at once from the same place, avoiding the merging of multiple columns of data. A column group is a row-and-line hybrid storage model that can meet both OLTP and OLAP query requirements.

Because of the high degree of data duplication in the same data column, there is a significant advantage in column database compression. For example, Google bigtable column database compression can be more than 15 times times the compression rate. In addition, specific index optimizations can be made for column storage. For example, the sex column has only two values, "male" and "female", which can be used to create a bitmap index on this column:

As shown, "male" corresponds to a bitmap of 100101, indicating that the 1th, 4, 6 row values are "male", "female" corresponds to a bitmap of 011010, which means that the 2nd, 3, 5 row values are "female". If you need to find the number of males or females, you only need to count the number of occurrences of 1 in the corresponding bitmap. In addition, after the establishment of bitmap index 0 and 1 of the high repetition, you can use a special encoding method to compress it.

  

The following conclusions are drawn:

Columnstore: Each column is stored separately and the data is indexed.

Only access to the columns involved, if we want to access a single column (such as name) will be quite quick.

A row of data contains one column or more columns, and each column is a single cell to store the data. Row storage, however, is to store a row of data as a whole.

In the world of Hana, there is not only columnstore, but also row-type storage. So the reader does not have to ask

When should I use row-type storage? When should I use column storage?

Row-based storage is recommended if you spend most of your time focusing on the contents of the entire table, rather than just a few columns, and you are concerned that you do not need to pass any aggregation operations. The reason is that refactoring each row of data (that is, the decompression process) is a heavy burden for Hana.

column storage, for example, you are concerned about a few columns of content, or have frequent aggregation needs, through the aggregation after the data analysis of the table.

The details are summarized as follows:

Choose Hana Columnstore

When the calculation is based on a column or a smaller column

Always focus on a table of several columns rather than the whole table data

When a data table has a very high number of columns

When a data table has very many rows of data and requires a clustered operation

There is a lot of duplicate data in the Data table column, which is advantageous for highly compressed

Choose Hana Row-type storage

Focus on the entire table or need to update the data frequently

Need to read entire rows of data frequently

No aggregation operations required, or fast query requirements

There are not many data rows in the data table itself

Data table columns themselves have too many unique data

HBase storage Google BigTable's open source implementation HDFs and HBase comparison 1, have good fault tolerance and Scalability 2, HDFs for batch processing, but does not support random lookups, not suitable for incremental data, does not support data update hbase is a good addition to HDFs Table Features 1, large: can store billions of rows, million columns 2, no mode: the same table can have different columns 3, sparse: Empty columns do not occupy storage space 4, multiple versions: Each cell has a version number if stored by row:

Store by column

Storage mode

Put table name, Rowkey, column family: column name, value

Logical View:

Data Model:

The HBase schema can have multiple tables, each consisting of multiple column family with a dynamic column

Physical model Each column family is stored in a separate file on HDFs Rowkey and version have a copy of the control not saved in each column family, the placeholder does not have

Row and column storage

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.