Analysis on the compression mode and compression library of Hadoop Sequncefile.writer

Source: Internet
Author: User

The Sequncefile compression type (Compression type) is divided into three none,record,block, specified by the configuration item Io.seqfile.compression.type:

None, do not compress records is not compressed

RECORD, Compress values only, each separately. Each record is compressed once for value

BLOCK, Compress sequences of records together in blocks. Block compression, when the cache key and value byte size reaches the specified threshold, compression is performed, and the threshold value is specified by the configuration item io.seqfile.compress.blocksize, with a default value of 1000000 bytes


    record, The compression algorithm used by block is determined by the compressionoption specified at the time of creation of the Sequncefile.writer,  compressionoption is the compression encoder, which defaults to ORG.APACHE.HADOOP.IO.COMPRESS.DEFAULTCODEC when not specified   The corresponding underlying compression library is zlib, except gzipcodec  lz4codec  snappycodec   Bzip2codec do not compare here

Defaultcodec when implementing zlib compression, you can specify the use of libhadoop.so (the native library provided by the Hadoop framework) or the Java.util.zip library. Here's how to open the Hadoop native library or Java Zip Library:


Sequncefile uses the Org.apache.hadoop.io.compress.DefaultCodec compression method by default, using the DEFLATE compression algorithm

Defaultcodec executes the class Zlibfactory.getzlibcompressor (conf) method when creating the compressor, implementing the code snippet:

Return (isnativezlibloaded (conf))? New Zlibcompressor (conf): New Builtinzlibdeflater (Zlibfactory.getcompressionlevel (conf). CompressionLevel ());

When loading the local zlib library, the Zlibcompressor compressor class is used, otherwise the Builtinzlibdeflater class is used, and the Builtinzlibdeflater class is called Java Java.util.zip.Inflater class implementation;


Where isnativezlibloaded is judged by whether the Nativecodeloader class has loaded the Hadoop native library, the code is as follows:

 try to load native hadoop library and set fallback flag  appropriately    if (log.isdebugenabled ())  {       Log.debug ("Trying to load the custom-built native-hadoop library ...");     }    try {      system.loadlibrary (" Hadoop ");       log.debug (" loaded the native-hadoop library ");       nativecodeloaded = true;    } catch   (throwable t)  {      // Ignore failure to  Load      if (log.isdebugenabled ())  {         log.debug ("failed to load native-hadoop with error: "  + t) ;          log.debug ("Java.library.path="  +             system.getproperty ("Java.library.path"));      }     }    if  (!nativecodeloaded)  {       Log.warn ("unable to load native-hadoop library for your platform... "  +                "using  Builtin-java classes where applicable ");     }

where System.loadlibrary ("Hadoop"); The search on Linux is libhadoop.so.

Summary: Hadoop uses java.util.zip when the local Hadoop library cannot be loaded. The Inflater class is used to compress the Sequncefile , and when it can be loaded into the local Hadoop library, the local library is utilized.


Here's a comparison of the performance differences between the utility native Hadoop library and the non-native Hadoop.


The path to the native library is not included in the JVM run parameter Java.library.path without using native Hadoop:

Java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib

Use the following to add the native library path to Hadoop:

Java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib: $HADOOP _home/lib/native

Virtual Machine cluster:

50w data, sequncefile compression mode is RECORD , key is random 10 bytes, value is random 200 bytes:

Native Lib Disabled:32689ms after compression 114.07 MB

Native Lib enabled:30625ms after compression 114.07 MB

50w data, Sequncefile compression mode is BLOCK, key is random 10 bytes, value is random 200 bytes:

  disabled:   11354ms       after compression  101.17 MB

Native Lib enabled:10699ms after compression 101.17 MB

Physical Machine cluster:

50w data, Sequncefile compression mode is RECORD, key is random 10 bytes, value is random 200 bytes:

Native Lib Disabled:21953ms after compression 114.07 MB

Native Lib enabled:24742ms after compression 114.07 MB


100w data, Sequncefile compression mode is RECORD, key is random 10 bytes, value is random 200 bytes:

Native Lib disabled:48555ms after compression 228.14 MB

Native Lib enabled:45770ms after compression 228.14 MB

100w data, sequncefile compression mode is RECORD , key is random 10 bytes, value is random 200 bytes, set zlib compression level to Best_speed:

Native lib disabled: 44872ms after compression 228.14 MB

Native Lib enabled: 51582ms after compression 228.14 MB

100w data, Sequncefile compression mode is BLOCK, key is random 10 bytes, value is random 200 bytes, set zlib compression level is best_speed:

Native lib disabled: 14374ms after compression 203.54 MB

Native Lib enabled: 14639ms after compression 203.54 MB


100w data, Sequncefile compression mode is BLOCK, key is random 10 bytes, value is random 200 bytes, set zlib compression level is default_compression:

Native lib disabled: 15397ms after compression 203.54 MB

Native Lib enabled: 13669ms after compression 203.54 MB

Analysis of the test results, summarized as follows:

There is not much difference between using Hadoop native library compression and using Java ZIP Libraries when different compression modes, or different amounts of data, and different zlib compression levels

Later try other native compression encoding: Gzipcodec Lz4codec snappycodec Bzip2codec




Analysis on the compression mode and compression library of Hadoop Sequncefile.writer

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.