Analysis on the compression mode and compression library of Hadoop Sequncefile.writer

Last Update:2015-08-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Sequncefile compression type (Compression type) is divided into three none,record,block, specified by the configuration item Io.seqfile.compression.type:

None, do not compress records is not compressed

RECORD, Compress values only, each separately. Each record is compressed once for value

BLOCK, Compress sequences of records together in blocks. Block compression, when the cache key and value byte size reaches the specified threshold, compression is performed, and the threshold value is specified by the configuration item io.seqfile.compress.blocksize, with a default value of 1000000 bytes

record, The compression algorithm used by block is determined by the compressionoption specified at the time of creation of the Sequncefile.writer, compressionoption is the compression encoder, which defaults to ORG.APACHE.HADOOP.IO.COMPRESS.DEFAULTCODEC when not specified The corresponding underlying compression library is zlib, except gzipcodec lz4codec snappycodec Bzip2codec do not compare here

Defaultcodec when implementing zlib compression, you can specify the use of libhadoop.so (the native library provided by the Hadoop framework) or the Java.util.zip library. Here's how to open the Hadoop native library or Java Zip Library:

Sequncefile uses the Org.apache.hadoop.io.compress.DefaultCodec compression method by default, using the DEFLATE compression algorithm

Defaultcodec executes the class Zlibfactory.getzlibcompressor (conf) method when creating the compressor, implementing the code snippet:

Return (isnativezlibloaded (conf))? New Zlibcompressor (conf): New Builtinzlibdeflater (Zlibfactory.getcompressionlevel (conf). CompressionLevel ());

When loading the local zlib library, the Zlibcompressor compressor class is used, otherwise the Builtinzlibdeflater class is used, and the Builtinzlibdeflater class is called Java Java.util.zip.Inflater class implementation;

Where isnativezlibloaded is judged by whether the Nativecodeloader class has loaded the Hadoop native library, the code is as follows:

 try to load native hadoop library and set fallback flag  appropriately    if (log.isdebugenabled ())  {       Log.debug ("Trying to load the custom-built native-hadoop library ...");     }    try {      system.loadlibrary (" Hadoop ");       log.debug (" loaded the native-hadoop library ");       nativecodeloaded = true;    } catch   (throwable t)  {      // Ignore failure to  Load      if (log.isdebugenabled ())  {         log.debug ("failed to load native-hadoop with error: "  + t) ;          log.debug ("Java.library.path="  +             system.getproperty ("Java.library.path"));      }     }    if  (!nativecodeloaded)  {       Log.warn ("unable to load native-hadoop library for your platform... "  +                "using  Builtin-java classes where applicable ");     }

where System.loadlibrary ("Hadoop"); The search on Linux is libhadoop.so.

Summary: Hadoop uses java.util.zip when the local Hadoop library cannot be loaded. The Inflater class is used to compress the Sequncefile , and when it can be loaded into the local Hadoop library, the local library is utilized.

Here's a comparison of the performance differences between the utility native Hadoop library and the non-native Hadoop.

The path to the native library is not included in the JVM run parameter Java.library.path without using native Hadoop:

Java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

Use the following to add the native library path to Hadoop:

Java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib: $HADOOP _home/lib/native

Virtual Machine cluster:

50w data, sequncefile compression mode is RECORD , key is random 10 bytes, value is random 200 bytes:

Native Lib Disabled:32689ms after compression 114.07 MB

Native Lib enabled:30625ms after compression 114.07 MB

50w data, Sequncefile compression mode is BLOCK, key is random 10 bytes, value is random 200 bytes:

disabled: 11354ms after compression 101.17 MB

Native Lib enabled:10699ms after compression 101.17 MB

Physical Machine cluster:

50w data, Sequncefile compression mode is RECORD, key is random 10 bytes, value is random 200 bytes:

Native Lib Disabled:21953ms after compression 114.07 MB

Native Lib enabled:24742ms after compression 114.07 MB

100w data, Sequncefile compression mode is RECORD, key is random 10 bytes, value is random 200 bytes:

Native Lib disabled:48555ms after compression 228.14 MB

Native Lib enabled:45770ms after compression 228.14 MB

100w data, sequncefile compression mode is RECORD , key is random 10 bytes, value is random 200 bytes, set zlib compression level to Best_speed:

Native lib disabled: 44872ms after compression 228.14 MB

Native Lib enabled: 51582ms after compression 228.14 MB

100w data, Sequncefile compression mode is BLOCK, key is random 10 bytes, value is random 200 bytes, set zlib compression level is best_speed:

Native lib disabled: 14374ms after compression 203.54 MB

Native Lib enabled: 14639ms after compression 203.54 MB

100w data, Sequncefile compression mode is BLOCK, key is random 10 bytes, value is random 200 bytes, set zlib compression level is default_compression:

Native lib disabled: 15397ms after compression 203.54 MB

Native Lib enabled: 13669ms after compression 203.54 MB

Analysis of the test results, summarized as follows:

There is not much difference between using Hadoop native library compression and using Java ZIP Libraries when different compression modes, or different amounts of data, and different zlib compression levels

Later try other native compression encoding: Gzipcodec Lz4codec snappycodec Bzip2codec

Analysis on the compression mode and compression library of Hadoop Sequncefile.writer

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analysis on the compression mode and compression library of Hadoop Sequncefile.writer

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Analysis on the compression mode and compression library of Hadoop Sequncefile.writer

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support