How Facebook's data warehouse expands to Pb

Last Update:2018-06-11 Source: Internet

Author: User

Tags hortonworks

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(Original article: ScalingtheFacebookdatawarehouseto300PB ?, This article is translated from the original article. Facebook's challenges in storage scalability in data warehouses are unique. Our Hive-based data warehouse stores more than Pb of data and is growing at a rate of TB per day. The number of data stored in this data warehouse last year

(Original article: Scaling the Facebook data warehouse to 300 PB ?, This article is translated from the original article. Facebook's challenges in storage scalability in data warehouses are unique. Our Hive-based data warehouse stores more than Pb of data and is growing at a rate of TB per day. The number of data stored in this data warehouse last year

(Original article: Scaling the Facebook data warehouse to 300 PB ?, This article is the translation of the original article)

Facebook's storage scalability challenges in data warehouses are unique. Our Hive-based data warehouse stores more than Pb of data and is growing at a rate of TB per day. The data volume stored in this data warehouse increased by three times last year. Considering this growth trend, the storage efficiency issue is currently the most important concern in our data warehouse infrastructure and in the future.

We have many innovations in improving the storage efficiency of data warehouses, for example, the establishment of a cold data storage data center, the use of HDFS similar to RAID technology to ensure high data availability without changing the premise of reducing the Redundancy Rate, data is compressed before being written to HDFS to reduce the storage space occupied by data. Hive is the most widely used system for transforming a large number of raw logs on Facebook, hive of Facebook is a query engine built on the Corona Map-Reduce framework to process data and create data warehouse tables. In this article, we mainly discuss the evolution of Hive table storage formats. The main purpose of this work is to make the storage format of Hive tables compressed as efficiently as possible.

RCFile

When data in our data warehouse is loaded into a table, the first storage Format used is Facebook's self-developed Record-Columnar File Format (RCFile ). RCFile is a hybrid column storage format that allows row-based query and provides column storage compression efficiency. The core idea of Hive is to split Hive tables horizontally into multiple row groups, and split them vertically by column, in this way, the column and column data are stored continuously on the disk.

When all columns in a row group are written to a disk, RCFile compresses data by column units using algorithms similar to zlib/lzo. When reading column data, use the inert decompression Policy (? Lazy decompression), that is to say, if a user's query involves only some columns in a table, RCFile skips the process of unneeded column decompression and deserialization. By selecting representative examples in our data warehouse, RCFile can provide a compression ratio of 5 times.

Beyond RCFile, what is the next step?

As the data volume stored in the data warehouse continues to grow, engineers in the group began to study techniques and methods to improve compression efficiency. The focus of the study is on column-level encoding methods, such as run-length encoding, dictionary encoding, and frame of reference encoding) a numeric encoding method that can reduce logical redundancy at the column level before the general compression process. We have also tried new column types (for example, JSON is widely used within Facebook, and the structured storage of JSON data can meet the needs of efficient query, it also reduces the redundancy of JSON metadata storage ). Our experiments show that column-level encoding can significantly improve the RCFile compression ratio if used properly.

At the same time, Hortonworks is trying a similar way to improve the Hive storage format. Hortonworks's engineering team designed and implemented ORCFile (including storage formats and read/write interfaces), which helped us design and implement a new storage format for Facebook's data warehouse.

ORCFile

When Hive data is written to a disk in ORCFile format, the data is divided into a series of MB strip (stripe), a strip is similar to a row group in RCFile. In each strip, ORCFile first encodes the data in each column, and then compresses all the columns in the whole strip using a zlib compression algorithm. Dictionary encoding is used for columns in a string format, and all row data in a band is put together for column encoding. Each stripe stores an index for each 10,000 rows of data and records the maximum and minimum values in each column. These can skip some rows out of the range during filter-based queries.

In addition to compression improvement, a significant advantage of the new storage format is that columns and rows are recorded in the form of offset, so that the end of the row does not need to be marked with delimiters. However, in RCFile, some ASCII values are reserved as delimiters, so these delimiters cannot appear in the data stream. At the same time, the query engine can optimize the query efficiency by using strip and each column's metadata at the file level.

Adaptive column Encoding

When we started to test ORCFile in the data warehouse, we found that some Hive tables perform well in compression, and some may lead to data expansion, as a result, the compression efficiency in testing on a representative table set in our data warehouse is not significantly improved. If the entropy of a column of data is large, Dictionary encoding will cause data expansion. Therefore, it is inappropriate to encode all string columns by default. You can use either of the following methods to determine whether a column requires dictionary encoding: You can use the column metadata specified by the user to statically specify the column metadata, and dynamically select the encoding method when you test the column value at run time. We chose the latter because it is compatible with the large number of existing tables in our data warehouse.

We conducted a lot of tests to find a way to maximize the compression ratio without affecting ORCFile write performance. As the string type occupies the dominant position in our largest tables, and about 80% of the columns in the data warehouse are composed of string types, optimizing the compression ratio of the string type is the most important. Set a threshold value for the number of different values in each column in each band. At the same time, we modified the ORCFile write interface, if you use dictionary encoding for data in each band, the compression efficiency is improved. At the same time, we sample the column values and check the character set combination of the column values. If the character set is relatively small, a general compression algorithm like zlib can have a relatively good compression ratio. In this case, Dictionary encoding is not necessary, and sometimes it may even cause side effects.

For large integer data, we can consider using the travel Length Encoding or dictionary encoding. In most cases, the travel length encoding is only a little better than the general compression algorithm. However, when the column data is composed of a few different numbers, the dictionary encoding will be better. Based on this result, we also use dictionary encoding instead of stroke length encoding for large integer data. Using this encoding method for data of string and numeric types can bring high compression efficiency to ORCFile.

We have also tested many other methods to increase the compression ratio. One of the ideas worth mentioning is adaptive travel length encoding, which is a heuristic strategy. It is used only when the length encoding can improve the compression ratio. In the open-source version of ORCFile, this policy is applied to the process of selecting the Encoding Algorithm for integer data in multiple methods. Although this method improves the compression ratio, the write performance is reduced. We also studied the effect of the Strip size on the compression ratio. Unexpectedly, increasing the size of the Strip cannot significantly increase the compression ratio. Because the number of dictionary elements increases as the band size increases, this increases the number of bytes occupied by the encoded column value. Therefore, if you want to store less dictionaries, the benefits of using MB as a strip size are not as expected.

Write Performance

Considering that data writing speed affects query performance in our scale, we have made many improvements to the open-source ORCFile writing interface to improve write performance. The key point is to eliminate redundancy or unnecessary operations, and optimize memory usage.

The most important improvement to the ORCFile write interface is the process of creating an ordered dictionary. In the open-source ORCFile version, to ensure the order of dictionaries, the writing interface is a dictionary stored using the red-black tree. Then in Adaptive Dictionary encoding, even if a column is not suitable for dictionary encoding storage, it will take O (log (n )) to insert a new key to the red-black tree. If you use a storage efficient hash ing to store the dictionary, the memory usage of the dictionary is reduced by 30% only when needed, and more importantly, the write performance is improved by 1.4 times. To adjust the dictionary size faster and more efficiently, the initial dictionary is stored by an array of byte arrays as elements. However, this will make it very frequent to access dictionary elements. We chose to use the Slice class of the Airlift library to replace the dictionary for efficient memory copy, which will increase the write performance by 20%-30%.

Opening dictionary encoding by column consumes computing resources. Because the characters in each column are repeated in different strip, we have found that dictionary encoding for all strip is not advantageous. Therefore, we improved the write interface to determine the column Encoding Algorithm Based on the band subset. At the same time, the corresponding column in the next band will repeat the algorithm above. That is to say, if the write interface judges that dictionary encoding is not advantageous for the value of this column, then dictionary encoding is not used intelligently in the following strip. Because of the improvement in compression efficiency brought by Facebook's own ORCFile format, we can reduce the compression level of zlib that we have used and improve the write performance by 20%.

Read Performance

When talking about read performance, we will soon think of the problem of laze column decompression ). For example, a query that reads many columns but only has filtering conditions in a column. If we do not implement the inert decompression, all columns will be read and decompressed. Ideally, only columns with filtering conditions are extracted and decoded, so that this query will not waste a lot of time on decompression and decoding of irrelevant data.

For this purpose, we use index strides in the ORCFile storage format to implement the inert decompression and decoding functions for Facebook's ORCFile read interface. In the previous example, all data involved in the column with filtering conditions will be decompressed and decoded. For the data of other columns, we change the read interface to read the corresponding index in the strip (perform operations on the metadata of the strip), and only extract and decode the row data before the corresponding index. After our tests, this improvement improved the performance of simple filter (select) queries running on our Facebook ORCFile version by three times higher than that of the open-source version. Facebook's ORCFile performs better in this simple filter-type query than RCFile because it does not have the inertia Decoding of additional data.

Summary

By applying these improvements, we use ORCFile in data warehouse data to increase the compression ratio by 5 to 8 times compared with RCFile. From a series of typical queries and Data Tests selected in the data warehouse, we found that Facebook's ORCFile write performance is three times higher than that of the open-source version.

We have applied this new storage format to many tables with dozens of petabytes of data. By changing the data from RCFile to ORCFile, we have generated dozens of petabytes of storage capacity. We are promoting this new storage format to other tables in the Data Warehouse. This improves storage efficiency and read/write performance. Our Facebook's ORCFile code is open-source, and we are working closely with the open-source community to incorporate these improvements into Apache Hive code.

Next?

We have many ideas to further improve ORCFile's compression ratio and read/write efficiency. Including support for new compression encoding such as LZ4HC, different column when encoding does not use different compression algorithms and compression levels, store more statistical information and expose it to the query engine for use, open source community predicate pushdown) and so on. There are also some other improvement ideas, for example, reduce the logical redundancy of source and derived tables, sample cold datasets, and add Hive native data types that are currently stored in string format but have general requirements.

Many colleagues in the Facebook Analysis Infrastructure group participated in ORCFile-related work, including the author of this article? Pamela Vagata, Kevin Wilfong and Sambavi Muthukrishnan. At the same time, I would like to thank my colleagues at Hortonworks for their cooperation and help in our work.

Original article address: how Facebook's data warehouse expanded to Pb, thanks to the original author for sharing.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More