The Sequncefile compression type (Compression type) is divided into three none,record,block, specified by the configuration item Io.seqfile.compression.type:
None, do not compress records is not compressed
RECORD, Compress values only, each separately. Each record is compressed once for value
BLOCK, Compress sequences of records together in blocks. Block compression, when the cache key and value byte size reaches the specified threshold, compression is performed, and the threshold value is specified by the configuration item io.seqfile.compress.blocksize, with a default value of 1000000 bytes
record, The compression algorithm used by block is determined by the compressionoption specified at the time of creation of the Sequncefile.writer, compressionoption is the compression encoder, which defaults to ORG.APACHE.HADOOP.IO.COMPRESS.DEFAULTCODEC when not specified The corresponding underlying compression library is zlib, except gzipcodec lz4codec snappycodec Bzip2codec do not compare here
Defaultcodec when implementing zlib compression, you can specify the use of libhadoop.so (the native library provided by the Hadoop framework) or the Java.util.zip library. Here's how to open the Hadoop native library or Java Zip Library:
Sequncefile uses the Org.apache.hadoop.io.compress.DefaultCodec compression method by default, using the DEFLATE compression algorithm
Defaultcodec executes the class Zlibfactory.getzlibcompressor (conf) method when creating the compressor, implementing the code snippet:
Return (isnativezlibloaded (conf))? New Zlibcompressor (conf): New Builtinzlibdeflater (Zlibfactory.getcompressionlevel (conf). CompressionLevel ());
When loading the local zlib library, the Zlibcompressor compressor class is used, otherwise the Builtinzlibdeflater class is used, and the Builtinzlibdeflater class is called Java Java.util.zip.Inflater class implementation;
Where isnativezlibloaded is judged by whether the Nativecodeloader class has loaded the Hadoop native library, the code is as follows:
try to load native hadoop library and set fallback flag appropriately if (log.isdebugenabled ()) { Log.debug ("Trying to load the custom-built native-hadoop library ..."); } try { system.loadlibrary (" Hadoop "); log.debug (" loaded the native-hadoop library "); nativecodeloaded = true; } catch (throwable t) { // Ignore failure to Load if (log.isdebugenabled ()) { log.debug ("failed to load native-hadoop with error: " + t) ; log.debug ("Java.library.path=" + system.getproperty ("Java.library.path")); } } if (!nativecodeloaded) { Log.warn ("unable to load native-hadoop library for your platform... " + "using Builtin-java classes where applicable "); }
where System.loadlibrary ("Hadoop"); The search on Linux is libhadoop.so.
Summary: Hadoop uses java.util.zip when the local Hadoop library cannot be loaded. The Inflater class is used to compress the Sequncefile , and when it can be loaded into the local Hadoop library, the local library is utilized.
Here's a comparison of the performance differences between the utility native Hadoop library and the non-native Hadoop.
The path to the native library is not included in the JVM run parameter Java.library.path without using native Hadoop:
Java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
Use the following to add the native library path to Hadoop:
Java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib: $HADOOP _home/lib/native
Virtual Machine cluster:
50w data, sequncefile compression mode is RECORD , key is random 10 bytes, value is random 200 bytes:
Native Lib Disabled:32689ms after compression 114.07 MB
Native Lib enabled:30625ms after compression 114.07 MB
50w data, Sequncefile compression mode is BLOCK, key is random 10 bytes, value is random 200 bytes:
disabled: 11354ms after compression 101.17 MB
Native Lib enabled:10699ms after compression 101.17 MB
Physical Machine cluster:
50w data, Sequncefile compression mode is RECORD, key is random 10 bytes, value is random 200 bytes:
Native Lib Disabled:21953ms after compression 114.07 MB
Native Lib enabled:24742ms after compression 114.07 MB
100w data, Sequncefile compression mode is RECORD, key is random 10 bytes, value is random 200 bytes:
Native Lib disabled:48555ms after compression 228.14 MB
Native Lib enabled:45770ms after compression 228.14 MB
100w data, sequncefile compression mode is RECORD , key is random 10 bytes, value is random 200 bytes, set zlib compression level to Best_speed:
Native lib disabled: 44872ms after compression 228.14 MB
Native Lib enabled: 51582ms after compression 228.14 MB
100w data, Sequncefile compression mode is BLOCK, key is random 10 bytes, value is random 200 bytes, set zlib compression level is best_speed:
Native lib disabled: 14374ms after compression 203.54 MB
Native Lib enabled: 14639ms after compression 203.54 MB
100w data, Sequncefile compression mode is BLOCK, key is random 10 bytes, value is random 200 bytes, set zlib compression level is default_compression:
Native lib disabled: 15397ms after compression 203.54 MB
Native Lib enabled: 13669ms after compression 203.54 MB
Analysis of the test results, summarized as follows:
There is not much difference between using Hadoop native library compression and using Java ZIP Libraries when different compression modes, or different amounts of data, and different zlib compression levels
Later try other native compression encoding: Gzipcodec Lz4codec snappycodec Bzip2codec
Analysis on the compression mode and compression library of Hadoop Sequncefile.writer