Rcfile and orcfile

Source: Internet
Author: User
Rcfile

Previously, I heard that rcfile can skip unnecessary columns when reading data, and does not need to read a whole row and select the required fields. Therefore, it is executed in hive.select a, b from tableA where c = 1This operation is relatively efficient. To satisfy your curiosity, I took a look at the rcfile paper (rcfile: a fast and space-efficient data placement structure in mapreduce-based warehouse system) and recorded it as a note.

First, big data query and processing have the following requirements:

  1. Fast data loading
  2. Fast data query and Processing
  3. Efficient storage to reduce storage space usage
  4. Well adapted to the dynamic query mode

In traditional database systems, there are three data storage methods:

  1. Horizontal Row Storage Structure:The row storage mode is used to store a whole row and contain all columns. This is the most common mode. This structure can well adapt to dynamic queries, suchselect a from tableAAndselect a, b, c, d, e, f, g from tableAIn this way, the query overhead is similar to that of the two queries. You need to read all the rows and extract the required columns. In this case, all data belonging to the same row is on the same HDFS block, and the cost of rebuilding a data row is relatively low. However, this has two major weaknesses: a) when a row contains many columns and we only need a few of them, we also have to read all the columns in the row, then retrieve some columns. This greatly reduces the query execution efficiency. B) When compression is performed based on multiple columns, the compression ratio is not too high because different column data types and value ranges are different.

  2. Vertical Column storage structure:Column store stores each column separately or several columns as a column group. Column storage can avoid reading unnecessary columns during query execution. Generally, the Data Type of the same column is the same, and the value range is smaller than that of the multiple columns. In this case, the compression ratio of the compressed data can be relatively high. However, this structure is difficult to reconstruct, especially when multiple columns in a proper row are not on one HDFS block. For example, if we get column A from the first datanode, column B from the second datanode, and column C from the third datanode, when C is combined into a row, the three columns need to be put together for reconstruction, which requires a large amount of network overhead and computing overhead.

  3. Hybrid Pax storage structure:The Pax structure is a structure used to mix row-store and column-store. It is mainly used in traditional databases to improve CPU cache utilization, and cannot be directly used in HDFS. However, rcfile also inherits its idea. It first stores data by row and then by column.

Rcfile Design and Implementation
  1. Data Layout:First, according to the HDFS structure, a table can be composed of multiple HDFS blocks. In each HDFS block, rcfile usesRow GroupOrganizes data for the basic unit. A table contains multipleRow GroupConsistent size. One HDFS block can contain multipleRow Group. EachRow GroupContains three parts. The first part isSync markerUsed to distinguish two consecutive multiple HDFS BlocksRow Group. The second part isRow GroupOfMetadata Header, Record eachRow GroupThe number of rows of data, the number of bytes of data in each column, and the number of bytes of data in each column. The third part isRow GroupThe actual data is stored by column.

  2. Data Compression:InMetadata HeaderPartial compression is performed using the RLE (Run Length Encoding) method, because the number of bytes for record a row of data in each column is the same, and these numbers appear repeatedly consecutively, therefore, the compression ratio is relatively high. When compressing the actual data, each column is compressed separately.

  3. Data append write:Rcfile maintains the data of each column in the memory, calledColumn holderWhen a record is added, it is first divided into multiple columns and then appended to the correspondingColumn holder, Update simultaneouslyMetadata Header. Memory can be controlled by the number of records or buffer sizeColumn holder. When the number of records or the buffer size exceeds the limit, the data is compressed first.Metadata Header, And then compressColumn holderAnd then write it to HDFS.

  4. Lazy decompression ):Rcfile is processingRow GroupOnly readMetadata HeaderAnd the required columns to take advantage of the I/O advantages of column storage. In addition, even if the column technology that appears in the query is read into the memory, it may not be decompressed. Only when it is determined that the column data needs to be decompressed, that is, lazy decompression ). For exampleselect a from tableA where b = 1To decompress Column B.Row GroupIn column B, the value is not 1, so there is no needRow GroupDecompress column A because the entireRow GroupAll are skipped.

  5. Row GroupSize: Row GroupIf you are too small, you cannot take full advantage of column storage, but too large may also cause problems. First, the experiment in this paper shows that whenRow GroupWhen a certain threshold value is exceeded, it is difficult to obtain a higher compression ratio. Second,Row GroupTo reduce the benefits brought by lazy decompressionselect a from tableA where b = 1For example, ifRow GroupContains a row.b = 1To decompress columnMetadata HeaderFind the information inb = 1The value of column A of the row.Row GroupSet it to 1 if it is smaller.b <> 1You do not need to extract column. At last, a general suggestion is given in the thesis.Row GroupSet it to 4 MB.

ORC File

Then there is the orc file (optimized row columnar file). We have made some optimizations to rcfile to overcome some limitations of rcfile. For details, refer to this document.
Compared with rcfile, Orc file has the following advantages:

  • Each task outputs only one file, which can reduce the load of namenode;
  • Supports various complex data types, such as datetime, decimal, and some complex types (struct, list, map, and Union );
  • Some lightweight index data is stored in the file;
  • Data-based block mode compression:
  • A. Integer columns are encoded using the travel length (run-length encoding)
  • String-type columns are dictionary encoded (Dictionary encoding );
  • Use multiple independent recordreaders to read the same file in parallel;
  • You can split files without scanning markers;
  • Memory required for binding read/write;
  • The storage of metadata uses protocol buffers, so it supports adding and deleting some columns.
File structure

ORC file contains a group of row data called stripes. In addition, the file footer of Orc file also contains some additional auxiliary information. At the end of the orc file, there is a zone called postscript, which is mainly used to store the Compression Parameters and the size of the compressed footer.

By default, the size of a stripe is 250 MB. Large stripes makes it more efficient to read data from HDFS.

The file footer contains the stripes information in the orc file, the number of rows in each stripe, and the Data Type of each column. Of course, it also contains some aggregate results at the column level, such as Count, Min, Max, and sum. Here we will post the diagram on the document:

Stripe Structure

Each stripe contains index data, row data, and stripe footer. stripe footer contains the directory of the stream location, which will be used for table scanning.

Index data contains the maximum and minimum values of each column and the row of each column. The offset is provided in the row index, which can jump to the correct compression block position.

Through the row index, many rows can be skipped during the fast reading process of the stripe, even though the size of the stripe is large. By default, a maximum of 10000 rows can be skipped.

Because many rows can be skipped through filtering and prediction, the table can be sorted in secondary keys, which can greatly reduce the execution time. For example, if the primary partition of your table is the transaction date, you can sort the subpartition (State, zip code, and last name.

Rcfile and orcfile

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.