Learning notes: The Hadoop optimization experience of the Twitter core Data library team

Source: Internet
Author: User
Tags object serialization

first, the source

Streaming Hadoop performance optimization at scale, lessons learned at Twitter

(Data planform @Twitter)

Second, feedback2.1 Overview

This paper introduces the core Data library team of Twitter, the performance analysis method used when using Hadoop to process offline tasks, and the problems and optimizations that have been identified to analyze Hadoop using Jvm/hotspot profile (-xprof) The method invocation cost of the job, the high cost of the Hadoop configuration object, the high cost of object serialization/deserialization in the sequencing of the mapreduce phase, and the optimization are given in the actual operational scenarios.

It introduces the Apache parquet, a column-oriented storage format, and is successfully applied to column project, with predicated Push-down technology to filter unwanted columns, greatly improving the performance of data compression ratios and serialization/deserialization.
Pure dry Goods.
32 Likes!

2.2 Optimization Summary

1) profile! (-xprofile) performance optimization can not rely on guessing, but should rely on analysis!
2) serialization is expensive, but there are a lot of serialization (operations) in Hadoop!
3) Choose a different storage format (row-oriented or column-oriented) according to the specific (data) access mode!
4) Use column projection.
5) In the Mr Phase of Hadoop, the sequencing overhead is high, using raw comparators to reduce overhead.
Note: This sort is for example comparator, which raises the serialization/deserialization operation.
6) I/O is not necessarily a bottleneck. More I/O in exchange for less CPU computing when necessary.

The Jvm/hotspot native profile capability (-XPROF) has the following advantages:
1) Low overhead (using stack sampling).
2) can reveal the most expensive method invocation.
3) Use standard output (Stdout) to write the results directly to the task Logs.

2.3 Configuration objects for Hadoop

1) The configuration object overhead of Hadoop is surprisingly high.
2) The Conf operation looks like a hashmap operation.

3) Constructor: Read + Unzip + parse an XML file from disk

4) get () call causes regular expression evaluation, variable substitution.

5) If the above methods are called in the loop, or if they are called once per second, the overhead is high. Some (Hadoop) jobs spend 30% of their time on configuration-related methods! (It's really an unexpected high cost)

In short, there is no profile (-xprof) technology, it is impossible to obtain the above insight, can not easily find the opportunity and direction of optimization, need to use the profile technology to know I/O and CPU who is the real bottleneck.

2.4 Compression of intermediate results
    • Xprof reveals that the compression and decompression operations in the spill thread consume a lot of time.
    • The intermediate result is temporary.
    • Replacing Lzo level 3 with the Lz4 method reduces the intermediate data by more than 30%, allowing it to be read faster.
    • And make some big jobs speed up 150%.

2.5 serialization and deserialization of records becomes the most expensive operation in a Hadoop job!

2.6 Serialization of records is CPU sensitive, in contrast, I/O is nothing!

2.7 How can I eliminate or reduce the (CPU) overhead caused by serialization/deserialization? 2.7.1 using Hadoop's raw Comparator API (to compare element sizes)

Cost Analysis: As shown in the map and reduce phases of the Hadoop Mr, the keys of the map results are deserialized in order to be sorted at this stage.

(deserialization operations) are expensive, especially for complex, non-primitive keys, and these keys are very common.

Hadoop provides a rawcomparator API for comparing serialized (RAW) data (byte-level):

Unfortunately, it is necessary to implement a custom comparator.

Now, assuming that the data has been serialized, the byte stream itself is easy to compare:
Scala has a very windy api,scala. There are also macros that can produce these APIs for:
Tuples, case classes, thrift objects, primitives, Strings, etc. data structures.

How to pull the wind law? First, define a dense and easy-to-compare data serialization (BYTE) format:

Next, generate a method for comparison to take advantage of this data format:

Compared with the above optimization method:

Speed up to 150%!
Then optimize!

2.7.2 using column projection

Do not read the columns that you do not want:

1) You can use Apache Parquet (column file format).

2) Use a special deserialization method to skip some unwanted fields in row-oriented storage.

In column-oriented storage, an entire column is stored sequentially (rather than to row-oriented storage, where columns are stored separately):

As you can see, column-oriented storage allows fields of the same type to be sorted sequentially (easy to compress):

With Lzo + parquet, the file is twice times smaller!

2.7.3 Apache Parquet

1) column-based storage, which can be used to effectively list the columns (projection).
2) can read columns on demand from disk.
3) More importantly: You can deserialize only the required columns!

Look at the effect:

As you can see, the smaller the number of columns, the greater the power of the parquet, and the more efficient it is than the Lzo Thrift when it comes to 40 columns.

    • In the case of reading all columns, parquet is generally slower than row-oriented storage.
    • Parquet is a dense format in which read performance is related to the number of columns in a pattern, and null reads also consume time.
    • The row-oriented format (thrift) is sparse, so its read performance is related to the number of columns in the data, and Null reads are not time consuming.

Skip the fields that you don't need, as follows:

    • Although, there is no reduction in I/O overhead
    • However, you can encode only the fields of interest into the object
    • The CPU time spent decoding a string is much higher than the overhead of reading from disk + skipping encoded bytes!

Look at the comparison of the various column mapping schemes:

Parquet Thrift also has a lot of space to optimize, and parquet is faster before the number of columns selected is less than 13 columns, which is relatively flat and most columns are generated.

    • You can also use the predicate push-down policy so that parquet can skip some data records that do not meet the filter criteria.
    • Parquet stores some statistics, such as the chunks of records, so in some scenarios, you can skip the entire block of data (chunk) by reading the statistics.

Note: The left image is column projection, the predicate push-down filter is in the picture, and the right image is the combination effect. You can see that many fields have been skipped, and that cliff can optimize the efficiency of serialization/deserialization.

The Push-down filter + Parquet optimization results are shown:

2.8 Conclusion

Sigh: Twitter is really a great company!
The above optimization means, the larger the cluster, the more Hadoop job, the more obvious effect!

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Learning notes: The Hadoop optimization experience of the Twitter core Data library team

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.