This article introduces the rcfile Storage Structure in Facebook's data analysis system, which combines the advantages of Row Storage and column storage and plays an important role in large-scale data analysis in the mapreduce environment.
Facebook introduced the Data Warehouse Hive at the 2010 icde (IEEE International Conference on Data Engineering) conference. Hive stores massive data in the hadoop system and provides a set of database-like data storage and processing mechanisms. It uses SQL-like languages to automate data management and processing. After statement parsing and conversion, it finally generates hadoop-based mapreduce tasks and completes data processing by executing these tasks. Figure 1 shows the hive data warehouse system structure.
Figure 1 system structure of hive Data Warehouse
Mapreduce-based Data Warehouse plays an important role in ultra-large-scale data analysis. For typical web service providers, these analyses help them quickly understand dynamic user behavior and changing user needs. The data storage structure is one of the key factors affecting the data warehouse performance. Commonly used file storage formats in hadoop include textfile that supports text and sequencefile that supports binary. They all belong to the row storage mode. Rcfile: a fast and spaceefficient data placement structure in mapreducebased warehouse systems, published by Facebook engineers, introduces an efficient data storage structure-rcfile (record columnar file ), and apply it to Facebook's data warehouse hive. Compared with the data storage structure of traditional databases, rcfile meets four key requirements of mapreduce-based Data Warehouses more effectively, that is, fast data loading, fast query processing, highly efficient storage space utilization, and strong adaptivity to highly dynamic workload patterns.
Data Warehouse requirements
Based on Facebook system features and user data analysis, in the mapreduce computing environment, Data Warehouses have four key requirements for data storage structures.
Fast data loading
For Facebook's product data warehouse, it is critical to quickly load data (write data. About 20 TB of data is uploaded to Facebook's data warehouse every day. Because network and disk traffic during data loading may interfere with normal query execution, it is necessary to shorten the data loading time.
Fast Query Processing
In order to meet real-time website requests and support a large number of read loads for highly concurrent users to submit queries, the query response time is critical, this requires that the underlying storage structure maintain high-speed query processing as the number of queries increases.
Highly efficient storage space utilization
High-speed user activity always requires scalable storage capacity and computing capabilities, and limited disk space requires reasonable management of massive data storage. In fact, the solution to this problem is to maximize disk space utilization.
Strong adaptivity to highly dynamic workload patterns
The same data set is provided to users of different applications for analysis in various ways. Some data analysis is a routine process that is periodically executed in a fixed mode, while others are queries initiated from the intermediate platform. Most loads do not follow any rule mode. This requires the underlying system to be highly adaptable to unpredictable dynamic data during data processing when the storage space is limited, instead of focusing on some special negative load mode.
Mapreduce Storage Policy
To design and implement an efficient data storage structure based on mapreduce data warehouse, the key challenge is to meet the above four requirements in the mapreduce computing environment. In traditional database systems, three data storage structures are widely studied: Row Storage Structure, column storage structure, and Pax hybrid storage structure. These three structures have their own characteristics, but simply porting these database-oriented storage structures to the mapreduce-based data warehouse system does not meet all requirements well.
Row store
As shown in figure 2, the advantage of the hadoop-based row storage structure is the high adaptability of fast data loading and dynamic load, because Row Storage ensures that all the domains with the same records are in the same cluster node, that is, the same HDFS block. However, the disadvantages of row store are also obvious. For example, it does not support fast query processing, because when the query only targets a few columns in multiple lists, it cannot skip unnecessary column reading; in addition, because columns with different data values are mixed, it is difficult for row store to obtain an extremely high compression ratio, that is, the space utilization is not greatly improved. Although entropy encoding and column correlation can achieve a better compression ratio, complicated data storage will increase the decompression overhead.
Figure 2 example of HDFS block internal storage
Column Storage
Figure 3 shows an example of table Storage by column group on HDFS. In this example, column A and column B are stored in the same column group, while column C and column D are stored in separate column groups. During query, the column store can avoid unnecessary reading and compress similar data in a column to achieve a high compression ratio. However, because the reconstruction of tuples is highly open, it cannot provide fast Query Processing Based on the hadoop system. Column store cannot ensure that all the domains of the same record are stored on the same cluster node. in example 2, the four domains of the record are stored in three HDFS blocks located on different nodes. Therefore, record reconstruction will lead to a large amount of data transmission through the cluster node network. Although multiple columns can reduce the overhead After grouping, it is not very adaptive to highly dynamic load mode. Unless all column groups are created in advance based on possible queries, an unpredictable combination of columns is required for a query. record refactoring may require two or more column groups. In addition, due to the overlapping columns among multiple groups, the column group may create redundant column data storage, which leads to a reduction in storage utilization.
Figure 3 Example of HDFS block-based column Storage
Pax hybrid Storage
The Pax storage model (for data morphing Storage Technology) uses a hybrid storage method to improve CPU cache performance. For multiple fields from different columns in the record, Pax places them in one disk page. On each disk page, Pax uses a mini page to store all the domains belonging to each column, and uses a page header to store the pointer of the mini page. Similar to row-store, Pax has strong adaptability to multiple dynamic queries. However, it does not meet the requirements of large distributed systems for high storage space utilization and fast query processing. The reason is: first, Pax does not have data compression, this part has little to do with Cache Optimization, but it is critical for large-scale data processing systems. It provides the possibility of column-Dimension Data Compression. Secondly, Pax cannot improve I/O performance, because it cannot change the actual page content, this restriction makes it difficult to quickly query and process large-scale data scanning. Again, Pax uses fixed pages as the basic unit of data organization, in a massive data processing system, Pax does not effectively store data domains of different sizes. This article describes the implementation of the rcf I l e Data Storage Structure on the hadoop system. This structure emphasizes: first, the tables stored in rcfile are horizontally divided into multiple row groups, and each row group is vertically divided so that each column can be stored separately. Second, rcfile uses one column-Dimension Data Compression in each row group, and provides a lazy decompression technology to avoid unnecessary column decompression during query execution. Third, rcfile supports flexible row group size, which must be weighed between data compression performance and query performance.
Rcfile Design and Implementation
The rcfile (record columnar file) storage structure follows the design concept of "first horizontal division, then vertical division". This idea comes from Pax. It combines the advantages of Row-store and column-store: first, rcfile ensures that data in the same row is on the same node, so the overhead of tuples refactoring is very low. Second, like column-store, rcfile can use column-Dimension Data Compression and skip unnecessary column reading. Figure 4 shows an example of rcfile storage in HDFS blocks.
Figure 4 example of rcfile storage in HDFS Blocks
Data format
Rcfile is designed and implemented on the HDFS Distributed File System. 4 shows that rcfile stores a table in the following data format.
Rcfile is based on the HDFS architecture and tables occupy multiple HDFS blocks.
In each HDFS block, rcfile organizes records based on Row groups. That is to say, all records stored in an HDFS block are divided into multiple row groups. For a table, all row groups are of the same size. An HDFS block has one or more row groups.
A row group consists of three parts. The first part is the synchronization identifier of the row group header, which is used to separate two consecutive row groups in HDFS blocks. The second part is the metadata header of the row group, which is used to store information about row group units, including the number of records in the row group, the number of bytes in each column, and the number of bytes in each field in the column. The third part is the table data segment, that is, the actual column stores data. All fields in the same column are stored in sequence. As shown in figure 4, all fields of column A are stored, and all fields of Column B are stored.
Compression Mode
In each row group of rcfile, the metadata header and table data segment are compressed respectively.
For all metadata headers, rcfile uses the RLE (Run Length Encoding) algorithm to compress data. Because the length values of all fields in the same column are stored in this Part sequentially, the RLE Algorithm can find the long sequence of repeated values, especially for fixed domain lengths.
The table data segment is not compressed as the whole unit; on the contrary, each column is compressed independently, using the gzip compression algorithm. Rcfile uses the heavyweight gzip compression algorithm to obtain a better compression ratio, but does not use the RLE Algorithm because column data is not sorted at this time. In addition, due to the lazy compression policy, when processing a row group, rcfile does not need to extract all columns. Therefore, a relatively high gzip decompression overhead can be reduced.
Although rcfile uses the same compression algorithm for all columns of table data, it may be better to use different algorithms to compress different columns. One of the future work of rcfile may be to choose the best Compression Algorithm Based on the Data Type and distribution of each column.
Data appending
Rcfile does not support any data write operations. It only provides one Append Interface, because the underlying HDFS currently only supports data append to the end of the file. The data appending method is described as follows.
Rcfile creates and maintains a memory column holder for each column. When records are appended, all domains are distributed and each domain is appended to its corresponding column holder. In addition, rcfile records the metadata of each domain in the metadata header.
Rcfile provides two parameters to control the number of records cached in the memory before the disk is written. One parameter is the record quantity limit, and the other is the memory cache size limit.
Rcfile first compresses the metadata header and writes it to the disk, then compresses each column holder, and writes the Compressed Column holder to a row group in the underlying file system.
Data Reading and lazy Decompression
In the mapreduce framework, mapper processes each row group in HDFS blocks sequentially. When processing a row group, rcfile does not need to read all contents of the row group to the memory.
Instead, it only reads the metadata header and the columns required for a given query. Therefore, it can skip unnecessary columns to obtain the I/O advantage of column storage. For example, the table TBL (C1, C2, C3, C4) has four columns and performs a query "select C1 from TBL where C4 = 1" for each row group, rcfile only reads the content of C1 and C4 columns. After the metadata header and required column data are loaded into the memory, they need to be decompressed. The metadata header is always extracted and maintained in the memory until rcfile processes the next row group. However, rcfile does not extract all loaded columns. Instead, it uses a lazy decompression technique.
Lazy decompression means that the column will not be decompressed in the memory until rcfile determines that the data in the column is actually useful for query execution. Lazy decompression is very useful because the query uses various where conditions. If a where condition cannot be met by all records in the row group, rcfile does not extract columns that do not meet the where condition. For example, in the preceding query, columns C4 in all row groups are extracted. However, for a row group, if column C4 does not have a field with the value of 1, you do not need to extract column C1.
Row group size
I/O performance is the focus of rcfile. Therefore, rcfile requires a large row group and variable size. The row group size is related to the following factors.
If the row group is large, the data compression efficiency is more effective than the row group hour. According to the observation of Facebook's daily applications, when the row group size reaches a threshold, increasing the row group size does not further increase the compression ratio under the gzip algorithm.
Increasing row groups can increase data compression efficiency and reduce storage capacity. Therefore, you are not recommended to use a small row group if you have a strong demand for Bucket reduction. Note that when the size of the row group exceeds 4 MB, the data compression ratio will be consistent.
Although increasing row groups can reduce the size of table Storage, it may damage the read performance of data because it reduces the performance improvement caused by lazy decompression. In addition, the row group changes will occupy more memory, which will affect other mapreduce jobs that are executed concurrently. Considering the storage space and query efficiency, Facebook selects 4 MB as the default row group size. Of course, you can also select parameters for configuration.
Summary
This article briefly introduces the rcfile storage structure, which is widely used in Facebook's data analysis system hive. First, rcfile has the data loading speed and load adaptability equivalent to row storage. Second, rcfile read optimization can avoid unnecessary column reading during table scanning, the test shows that in most cases, it has better performance than other structures. Once again, rcfile uses column-dimension compression, which can effectively improve the storage space utilization.
To improve storage space utilization, data generated by Facebook product lines has been stored in the rcfile structure since 2010, and data stored in the sequencefile/textfile structure is also saved in the rcfile format. In addition, Yahoo has integrated rcfile in pig's data analysis system, which is being used in another hadoop-based data management system, Howl (http://wiki.apache.org/pig/howl ). In addition, according to the hive development community, rcfile has also been successfully integrated into other mapreduce-based Data Analysis Platforms. There is reason to believe that rcfile, as the data storage standard, will continue to play an important role in large-scale data analysis in the mapreduce environment.
From: http://blog.csdn.net/wanghai__/article/details/6409680