From NSM to parquet: The derivation of the storage structure

Last Update:2014-12-06 Source: Internet

Author: User

Tags repetition

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In order to optimize the performance of various tools prior to MapReduce and MR , a number of different storage methods have emerged in the Hadoop built-in data storage format. such as optimizing the rcfileof Hive performance, and with Impala to achieve Google Dremel features ( similar to even the feature of the superset ) parquet and so on. Come and study together today. The evolution of data storage in HDFS .

Data Placement Structure

Data placement Structure, as the name implies, is how data is placed and stored in HDFS . This placement structure is very important for query tools such as Hive , which has a direct impact on the implementation and performance of the Hive query engine. From the hive perspective, the data placement structure is how to map the logical view of the relational table in hive to the physical storage of the HDFS block data .

From a higher level, not only the application of the distributed system of HDFS ,data placement is also important for traditional databases,NoSQL and other systems:

In general, there are three types of data placement structures:

? horizontal row storage structure

? Vertical column storage structure

? Hybrid storage structure

Here's a look at the pros and cons of these three ways of storing.

Horizontal row storage structure

Row storage is the most traditional way of storage, the classic model is NSM (the n-ary Storage model), its advantages and disadvantages are also obvious. The advantage is that data is loaded very quickly, because one row of data is put together. At the same time, a variety of dynamic workload have a strong adaptability ( specific refers to? ). And the downside is:

? because you cannot avoid reading unnecessary columns, you cannot provide high-performance queries for massive amounts of data.

? In addition, it is not easy to achieve high compression because of the mixed storage of different types of data columns.

is a typical implementation of NSM , mainly by page header,body ( data ), and trailer ( each row of data in the current page Offset from the middle position ). Also points to a disadvantage in that the cache is stuffed with useless data because unnecessary columns are loaded every time.

Let's take a look at two examples of NSM . The first is the MySQL Innodb engine,Innodb 's tablespace mimics the concepts in Oracle , and the page structure (page layout, page structure or row format) is similar to the NSM model. We focus primarily on the page structure, and other row-based databases are similar storage structures.

hadoop example. The two formats we use most often are textfile and sequencefile , which are all stored by row. textfile is convenient to read, but compared to the space, gzip compression does not support block separation. hadoop internally uses sequencefile binary storage, which supports separable compression algorithms. Also, because when we write a file to hdfs , it is divided into block block, read is the whole block after reading by delimiter parsing, line-by-row to map task (HDFS is designed for streaming large files , there is no random read line in rdbms , so block The trailer does not require trailer section to hold the offset .

Vertical column Storage structure

Columnstore compensates for the two disadvantages of row storage, reading only the data of the target column and providing a high compression ratio. But the downside is:

? because of the fragmentation of the data store, queries can incur high data reconstruction (tuple reconstruction) overhead.

? and updates, deletions, and other modifications to the data can be cumbersome.

The typical implementation is the DSM model, which has been put forward in 1985 years.

Is what the Columnstore looks like in HDFS .

There are many NoSQL databases in Columnstore today, and the most typical example in Hadoop is HBase . ( to be added:hfile structure, and how HBase solves the problem of performance overhead for Columnstore )

Hybrid Storage Structure

PAX (The Partition Attributes Across Model) is a typical hybrid implementation, compared to the specific model of the previous two traditional storage methods:

Cite an article :

nsm ( n-ary Storage Model nsm low utilization of cache. nsm

DSM ( decomposition Storage Model ). The Columnstore model is not a fresh concept, and has been proposed in 1985 years, with the 2005 of data analysis applied extensively. The use of data, especially the needs of analysis, often uses only a subset of the data from one record. In order to reduce The consumption of IO, a "decomposition storage model" is proposed. the DSM divides the relationship vertically into n child relationships, and the properties are accessed only when needed. For queries involving multiple properties, additional overhead is required to connect the child relationships.

PAX(Partition Attribute Across). PAXis a mixed layout of records in a page, combined withNSMand theDSMTo avoid unnecessary access to main memory. PAX first stores as many relational records as possible in NSM manner. Within each page, DSM - like storage is used by attributes and minipage . in sequential scans,PAXTake advantage of cached resources. At the same time, all the records are on the same page. For a record refactoring operation, only theMinipageand does not involve cross-page operations. Relative toDSM, for multi-attribute queriesPAXBetter thanDSM, becauseDSMmore cross-page refactoring time is required.

Hybrid Storage model, we can interpret all the data askey/value/description(Column Name) constitutes a ternary storage model. KVmodels allow you to organize the storage of your data in the pattern you want, and if your app is always accessed by rows (such as always accessing most of a user's data), then you can use the data in the sameKeyorganized together (which is actuallyNSM), and if an app always parses a rollup query, you can followDescription(Column Name) to organize the data together (DSMorPAXimplementation).

Record Columnar File (rcfile) The PAX storage model has been used to mix both row and column-based storage. By first horizontal partitioning, then vertical partitioning, and ensure that the same row of data must be at the same node .

Rcfile based on HDFS, a table can contain multiple blocks, organized by row group within each block. Each row group contains the Sync tag, the metadata header, and the column-stored table data used to separate the row groups. where the metadata header and table data are individually compressed. The metadata header uses the RLE (runtime length encoding) algorithm, while the table data uses the gzip algorithm, and with the delay decompression technique (lazy decompression). Rcfile only supports append (append) write data.

Parquet

parquet cloudera and twitter project, implemented dremel The data model defined in this paper is able to represent nested records in a two-dimensional table that is stored in columns, while also supporting row-based query engines such as pig and hive . The storage structure of parquet is similar to rcfile , such as rowgroup The contains multiple column , and each column is page composition, each item in page is made up of repetition level , and value .

Parquet use a variety of coding compression techniques. First, for a column with a small number of distinct values can be dictionary encoding (dictionary encoding), such as non-repeating value <5w , which is more than gzip, Lzo, snappy and other heavy-duty algorithms to better, Faster. In addition, for dictionary-encoded column values, small integers such asrepetition and definition level can also be bit-compressed (bit packing), Save them with the fewest bits that are capable of loading these small integers. Finally, combined with the first two methods, you can further the RLE (run length encoding) compression, which is relatively sparse for the definition level of the column said the effect is better.

References

1 rcfile:a Fast and space-efficient Data Placement Structure

2 A multi-resolution Block Storage Model for Database Design

3 Data Page Layouts for relational Databases in deep Memory hierarchies

4 InnoDB internals:innodb File Formats and Source Code Structure

5 Parquet:an Open Columnar Storage for Hadoop

From NSM to parquet: The derivation of the storage structure

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More