Learning notes: The Hadoop optimization experience of the Twitter core Data library team

Last Update:2015-07-15 Source: Internet

Author: User

Tags object serialization

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

first, the source

Streaming Hadoop performance optimization at scale, lessons learned at Twitter

(Data planform @Twitter)

Second, feedback2.1 Overview

This paper introduces the core Data library team of Twitter, the performance analysis method used when using Hadoop to process offline tasks, and the problems and optimizations that have been identified to analyze Hadoop using Jvm/hotspot profile (-xprof) The method invocation cost of the job, the high cost of the Hadoop configuration object, the high cost of object serialization/deserialization in the sequencing of the mapreduce phase, and the optimization are given in the actual operational scenarios.

It introduces the Apache parquet, a column-oriented storage format, and is successfully applied to column project, with predicated Push-down technology to filter unwanted columns, greatly improving the performance of data compression ratios and serialization/deserialization.
Pure dry Goods.
32 Likes!

2.2 Optimization Summary

1) profile! (-xprofile) performance optimization can not rely on guessing, but should rely on analysis!
2) serialization is expensive, but there are a lot of serialization (operations) in Hadoop!
3) Choose a different storage format (row-oriented or column-oriented) according to the specific (data) access mode!
4) Use column projection.
5) In the Mr Phase of Hadoop, the sequencing overhead is high, using raw comparators to reduce overhead.
Note: This sort is for example comparator, which raises the serialization/deserialization operation.
6) I/O is not necessarily a bottleneck. More I/O in exchange for less CPU computing when necessary.

The Jvm/hotspot native profile capability (-XPROF) has the following advantages:
1) Low overhead (using stack sampling).
2) can reveal the most expensive method invocation.
3) Use standard output (Stdout) to write the results directly to the task Logs.

2.3 Configuration objects for Hadoop

1) The configuration object overhead of Hadoop is surprisingly high.
2) The Conf operation looks like a hashmap operation.

3) Constructor: Read + Unzip + parse an XML file from disk

4) get () call causes regular expression evaluation, variable substitution.

5) If the above methods are called in the loop, or if they are called once per second, the overhead is high. Some (Hadoop) jobs spend 30% of their time on configuration-related methods! (It's really an unexpected high cost)

In short, there is no profile (-xprof) technology, it is impossible to obtain the above insight, can not easily find the opportunity and direction of optimization, need to use the profile technology to know I/O and CPU who is the real bottleneck.

2.4 Compression of intermediate results

Xprof reveals that the compression and decompression operations in the spill thread consume a lot of time.
The intermediate result is temporary.
Replacing Lzo level 3 with the Lz4 method reduces the intermediate data by more than 30%, allowing it to be read faster.
And make some big jobs speed up 150%.

2.5 serialization and deserialization of records becomes the most expensive operation in a Hadoop job!

2.6 Serialization of records is CPU sensitive, in contrast, I/O is nothing!

2.7 How can I eliminate or reduce the (CPU) overhead caused by serialization/deserialization? 2.7.1 using Hadoop's raw Comparator API (to compare element sizes)

Cost Analysis: As shown in the map and reduce phases of the Hadoop Mr, the keys of the map results are deserialized in order to be sorted at this stage.

(deserialization operations) are expensive, especially for complex, non-primitive keys, and these keys are very common.

Hadoop provides a rawcomparator API for comparing serialized (RAW) data (byte-level):

Unfortunately, it is necessary to implement a custom comparator.

Now, assuming that the data has been serialized, the byte stream itself is easy to compare:
Scala has a very windy api,scala. There are also macros that can produce these APIs for:
Tuples, case classes, thrift objects, primitives, Strings, etc. data structures.

How to pull the wind law? First, define a dense and easy-to-compare data serialization (BYTE) format:

Next, generate a method for comparison to take advantage of this data format:

Compared with the above optimization method:

Speed up to 150%!
Then optimize!

2.7.2 using column projection

Do not read the columns that you do not want:

1) You can use Apache Parquet (column file format).

2) Use a special deserialization method to skip some unwanted fields in row-oriented storage.

In column-oriented storage, an entire column is stored sequentially (rather than to row-oriented storage, where columns are stored separately):

As you can see, column-oriented storage allows fields of the same type to be sorted sequentially (easy to compress):

With Lzo + parquet, the file is twice times smaller!

2.7.3 Apache Parquet

1) column-based storage, which can be used to effectively list the columns (projection).
2) can read columns on demand from disk.
3) More importantly: You can deserialize only the required columns!

Look at the effect:

As you can see, the smaller the number of columns, the greater the power of the parquet, and the more efficient it is than the Lzo Thrift when it comes to 40 columns.

In the case of reading all columns, parquet is generally slower than row-oriented storage.
Parquet is a dense format in which read performance is related to the number of columns in a pattern, and null reads also consume time.
The row-oriented format (thrift) is sparse, so its read performance is related to the number of columns in the data, and Null reads are not time consuming.

Skip the fields that you don't need, as follows:

Although, there is no reduction in I/O overhead
However, you can encode only the fields of interest into the object
The CPU time spent decoding a string is much higher than the overhead of reading from disk + skipping encoded bytes!

Look at the comparison of the various column mapping schemes:

Parquet Thrift also has a lot of space to optimize, and parquet is faster before the number of columns selected is less than 13 columns, which is relatively flat and most columns are generated.

You can also use the predicate push-down policy so that parquet can skip some data records that do not meet the filter criteria.
Parquet stores some statistics, such as the chunks of records, so in some scenarios, you can skip the entire block of data (chunk) by reading the statistics.

Note: The left image is column projection, the predicate push-down filter is in the picture, and the right image is the combination effect. You can see that many fields have been skipped, and that cliff can optimize the efficiency of serialization/deserialization.

The Push-down filter + Parquet optimization results are shown:

2.8 Conclusion

Sigh: Twitter is really a great company!
The above optimization means, the larger the cluster, the more Hadoop job, the more obvious effect!

Learning notes: The Hadoop optimization experience of the Twitter core Data library team

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More