Learning about the Hadoop Compression decompression Architecture

Source: Internet
Author: User
Tags rewind

Learning about the Hadoop Compression decompression Architecture

The Compressor decompression module of Hadoop is another major module in the Hadoop Common IO module. Although in real life, there are not so many scenarios where we use compression tools. Perhaps in our potential consciousness, the concept of compression stays on some types of compression, such as zip, gzip, bizp, and so on, with different compression ratios and efficiency ratios. After reading this article, you may learn about the Hadoop compression framework.

Compression is crucial for data transfer rooms, and also greatly improves storage efficiency. The compression algorithms currently supported in Hadoop systems include 1 and gzip 2. bzip 3. snappy4.default default system default algorithm. These compression tools are embodied through an object called CompressionCodec. Let's take a look at this class:

/** * This class encapsulates a streaming compression/decompression pair. */public interface CompressionCodec {  CompressionOutputStream createOutputStream(OutputStream out) throws IOException;    CompressionOutputStream createOutputStream(OutputStream out,                                              Compressor compressor) throws IOException;  Class
  getCompressorType();    Compressor createCompressor();    CompressionInputStream createInputStream(InputStream in) throws IOException;    CompressionInputStream createInputStream(InputStream in,                                            Decompressor decompressor) throws IOException;  Class
  getDecompressorType();    Decompressor createDecompressor();    String getDefaultExtension();}
This is an interface that defines many methods. I mainly classify them as two types,

One is the structure of Compressor and Decompressor decompression,

One is CompressionInputStream, and CompressionOutputStream compresses the input and output streams.

In fact, the two are very similar, because many operations to compress the input and output streams are also based on the above compressors, which are implemented by the decompression operation. The specific compression algorithm is inherited from this base class. Take a look at the relatively large structure:


We can see that in each Codec subclass, the implementation of the uncompressor and the construction of compressing the input and output streams will exist. The compression algorithm class is stored in a Codec factory and called through a unified interface. <喎?http: www.bkjia.com kf ware vc " target="_blank" class="keylink"> VcD4KPHA + PC9wPgo8cHJlIGNsYXNzPQ = "brush: java;"> public class CompressionCodecFactory {public static final Log LOG = LogFactory. getLog (CompressionCodecFactory. class. getName ();/*** A map from the reversed filename suffixes to the codecs. * This is probably overkill, because the maps shoshould be small, but it * automatically supports finding the longest matching suffix. * All the decompressed encoding classes are placed in the codecs Map. CompressionCodec is a base class. * The inherited subclass */private SortedMap can be added. Codecs = null;During initialization, you can add the desired compression algorithm type according to the Configuration:

/*** Find the codecs specified in the config value io. compression. codecs * and register them. ults to gzip and zip. * initialize the compression encoding factory according to the Configuration. The gzip and zip encoding classes are added by default */public CompressionCodecFactory (Configuration conf) {codecs = new TreeMap
 
  
(); List
  
   
> CodecClasses = getCodecClasses (conf); if (codecClasses = null) {// if no encoding class is set, add gzip and defaultCode addCodec (new GzipCodec ()); addCodec (new DefaultCodec ();} else {Iterator
   
    
> Itr = codecClasses. iterator (); while (itr. hasNext () {CompressionCodec codec = ReflectionUtils. newInstance (itr. next (), conf); addCodec (codec );}}}
   
  
 
Then it is time to extract the compression algorithm tool from the factory and get it by name. In fact, this mode is similar to the enjoyment mode, achieving the reuse of objects.

/*** Find the relevant compression codec for the given file based on its * filename suffix. * @ param file the filename to check * @ return the codec object */public CompressionCodec getCodec (Path file) {CompressionCodec result = null; if (codecs! = Null) {String filename = file. getName (); String reversedFilename = new StringBuffer (filename). reverse (). toString (); SortedMap
 
  
SubMap = codecs. headMap (reversedFilename); if (! SubMap. isEmpty () {String potentialSuffix = subMap. lastKey (); // retrieves the corresponding CompressionCodec if (reversedFilename. startsWith (potentialSuffix) {result = codecs. get (potentialSuffix) ;}} return result ;}
 
The following describes how to implement the decompression. I have selected zlib/gzip as the underlying implementation example from the three algorithms. They all first inherited the following interface:

/*** Specification of a stream-based 'ressor 'which can be * plugged into a {@ link CompressionOutputStream} to compress data. * This is modelled after {@ link java.util.zip. deflater} **/public interface Compressor {/*** Sets input data for compression. * This shoshould be called whenever # needsInput () returns *trueIndicating that more input data is required. * enter the data to be compressed ** @ param B Input data * @ param off Start offset * @ param len Length */public void setInput (byte [] B, int off, int len);/*** Returns true if the input data buffer is empty and * # setInput () shocould be called to provide more input. * determine whether data can be input in the buffer ** @ returntrueIf the input data buffer is empty and * # setInput () shocould be called in order to provide more input. */public boolean needsInput ();/*** Sets preset dictionary for compression. A preset dictionary * is used when the history buffer can be predetermined. ** @ param B Dictionary data bytes * @ param off Start offset * @ param len Length */public void setDictionary (byte [] B, int off, int len ); /*** Return number of uncompressed bytes input so far. * Return the byte length of uncompressed data */public long getBytesRead ();/*** Return number of compressed bytes output so far. * returns the size of compressed bytes */public long getBytesWritten ();/*** When called, indicates that compression shocould end * with the current contents of the input buffer. * indicates the end of input */public void finish ();/*** Returns true if the end of the compressed * data output stream has been reached. * @ returntrueIf the end of the compressed * data output stream has been reached. * determine whether there are unextracted compressed data in the compressed file */public boolean finished ();/*** Fills specified buffer with compressed data. returns actual number * of bytes of compressed data. A return value of 0 indicates that * needsInput () shocould be called in order to determine if more input * data is required. * compression processing method, compress the input compressed data and output it to the input output Buffer ** @ param B Buffer for the compressed data * @ param off Start offset of the data * @ param len Size the buffer * @ return The actual number of bytes of compressed data. */public int compress (byte [] B, int off, int len) throws IOException;/*** Resets compressor so that a new set of input data can be processed. * compression reset method */public void reset ();/*** Closes the compressor and discards any unprocessed input. * close the compressed file. Generally, call */public void end () at the end (); /*** Prepare the compressor to be used in a new stream with settings defined in * the given Configuration * reinitialize the implementation of the compressors Based on the Configuration ** @ param conf Configuration from which new setting are fetched */public void reinit (Configuration conf );}
Each method in it is critical because the subsequent compression and decompression operations are based on the above function implementation. Here, let's take a look at some zlib settings on the compressed variables:

Public class ZlibCompressor implements Compressor {// default buffer 64 k private static final int DEFAULT_DIRECT_BUFFER_SIZE = 64*1024; // HACK-Use this as a global lock in the JNI layer private static Class clazz = ZlibCompressor. class; private long stream;/*** defines the compression level, which can be lossless compression or fast compression for efficiency. */private CompressionLevel; /*** defines the compression policy, such as the common hafman encoding method, filterd method, or other */private CompressionStrategy strategy;/*** specifies the compressed Header Format information, for example, there is usually a checksum. You can choose NO_HEAEDER */private final CompressionHeader windowBits;
There are also buffer settings, which define the settings of the uncompressed buffer, the compressed buffer, and so on:

Private int directBufferSize; private byte [] userBuf = null; private int userBufOff = 0, userBufLen = 0; // uncompressed Buffer private Buffer uncompressedDirectBuf = null; private int uncompressedDirectBufOff = 0, uncompressedDirectBufLen = 0; // compressed Buffer data private Buffer compressedDirectBuf = null; // input end ID, compression end ID private boolean finish, finished;
The default zlib zip is constructed as follows:

/*** Creates a new compressor with the default compression level. * Compressed data will be generated in ZLIB format. * default constructor, compression level, and policy are all default values */public ZlibCompressor () {this (CompressionLevel. DEFAULT_COMPRESSION, CompressionStrategy. DEFAULT_STRATEGY, CompressionHeader. DEFAULT_HEADER, DEFAULT_DIRECT_BUFFER_SIZE );}
I regret to go to a large overload function:

Public ZlibCompressor (CompressionLevel, CompressionStrategy strategy, CompressionHeader, int directBufferSize) {this. level = level; this. strategy = strategy; this. windowBits = header; stream = init (this. level. compressionLevel (), this. strategy. compressionStrategy (), this. windowBits. windowBits (); // set the direct buffer size to 64*1024 bytes this. directBufferSize = directBufferSize; // apply for two 64 K buffer zones of the same size: uncompressedDirectBuf = ByteBuffer. allocateDirect (directBufferSize); compressedDirectBuf = ByteBuffer. allocateDirect (directBufferSize); // move the compressed buffer location to compressedDirectBuf. position (directBufferSize );}
The key above is to define two 64 K buffers. We can conclude that the compression implementation process must be put from the user's input into the uncompressedDirectBuf, call the Compress compression method, and then transfer it to compressedDirectBuf, finally, it is copied to the external buffer.

The following are the compression steps in a compressed file:

1. Call setInput () to input the data to be compressed.

2. Call needsInput () to determine whether the data can be input. If the data cannot be input, Compress is called to retrieve the compressed data before entering the data again.
3. Execute the above two steps until all input data is complete. Call finish () to end the input.

4. At last, Compress is called to extract the compressed data continuously until the finished () method is called to determine whether there is no data in the compressed buffer.
Below is a flowchart

After the process is clarified, we can know how it is actually done. First, setInput () method,

Public synchronized void setInput (byte [] B, int off, int len) {if (B = null) {throw new NullPointerException ();} if (off <0 | len <0 | off> B. length-len) {throw new ArrayIndexOutOfBoundsException ();} // set the user buffer data variable this. userBuf = B; this. userBufOff = off; this. userBufLen = len; setInputFromSavedData (); // Reinitialize zlib's output direct buffer // Reinitialize the compressed buffer position compressedDirectBuf. limit (directBufferSize); compressedDirectBuf. position (directBufferSize );}
The key method is setInputFromSavedData ()

Synchronized void setInputFromSavedData () {bytes = 0; // update the uncompressed buffer data length to the user's buf length uncompressedDirectBufLen = userBufLen; if (uncompressedDirectBufLen> directBufferSize) {// if the maximum value is exceeded, the value is uncompressedDirectBufLen = directBufferSize;} // Reinitialize zlib's input direct buffer uncompressedDirectBuf. rewind (); // saves user data to uncompressedDirectBuf (ByteBuffer) uncompressedDirectBuf ). put (userBuf, userBufOff, uncompressedDirectBufLen); // Note how much data is being fed to zlib // Add the user buffer offset userBufOff + = uncompressedDirectBufLen; // reduce the Data Length of the user buffer zone userBufLen-= uncompressedDirectBufLen ;}
In this way, the user data is put into the uncompressedDirectBuf, but there is a limit on the maximum buffer value. Next let's take a look at the needsInput () judgment to determine whether it can be input:

Public boolean needsInput () {// Consume remaining compressed data? If (compressedDirectBuf. remaining ()> 0) {// if there is data in the compressed buffer, You must retrieve the data first before entering return false ;} // Check if zlib has consumed all input if (uncompressedDirectBufLen <= 0) {// determine whether the uncompressed buffer size is smaller than or equal to 0 // Check if we have consumed all user-input if (userBufLen <= 0) {// return true if all the previously buffered data has been returned;} else {setInputFromSavedData () ;}} return false ;}
It goes through multi-layer judgment. For example, the first filter condition is to extract the compressed data before the data can be input again, and the last input data must be deleted, after reading this code, it is totally different from the original idea of judging whether the uncompressed buffer zone is full or not to be input, because some materials are just like this, and it makes sense to think about it, after reading the code, I found that the truth is not like this. If it is used to determine whether the uncompressed buffer zone is full, the second judgment is directly unsigned and false is returned.

The following is the end mark of the input. The operation is simple:

Public synchronized void finish () {// change the input end ID to true finish = true ;}
The following are the key Compress compression operations:

Public synchronized int compress (byte [] B, int off, int len) throws IOException {if (B = null) {throw new NullPointerException ();} if (off <0 | len <0 | off> B. length-len) {throw new ArrayIndexOutOfBoundsException ();} int n = 0; // Check if there is compressed data // Check whether there is data in the compressed buffer. n = compressedDirectBuf. remaining (); if (n> 0) {n = Math. min (n, len); // extract and put it into the input output buffer (ByteBuffer) compressedDirectBuf ). get (B, off, n); return n;} // Re-initialize the zlib's output direct buffer // if no data exists in the compressed buffer, reset compressedDirectBuf. rewind (); compressedDirectBuf. limit (directBufferSize); // Compress data // call the native method of the compressed data. The uncompressed data is compressed and then transferred to the compressed buffer. n = deflateBytesDirect (); compressedDirectBuf. limit (n); // Get atmost 'len' bytes n = Math. min (n, len); // extract the compressed data from the compressed buffer (ByteBuffer) compressedDirectBuf ). get (B, off, n); return n ;}
Many operations, categorized

1. If there is still data in the compressed buffer, extract compressedDirectBuf to the output buffer, that is, transfer B, and the operation ends.

2. If no data exists in the compressed buffer, deflateBytesDirect () will call the method to compress the data, and then perform 1 operation to retrieve the buffered data. The operation is complete.

The key deflateBytesDirect method is found as follows:

private native static void initIDs();  private native static long init(int level, int strategy, int windowBits);  private native static void setDictionary(long strm, byte[] b, int off,                                           int len);  private native int deflateBytesDirect();  private native static long getBytesRead(long strm);  private native static long getBytesWritten(long strm);  private native static void reset(long strm);  private native static void end(long strm);

We have found a bunch of native methods, that is, these more underlying implementations are called through JNI, but we can basically guess, in this method, there are two buffer methods for compression transfer. The last method is to determine whether the method ends.

Public synchronized boolean finished () {// Check if 'zlib 'says its 'finished' and // all compressed data has been consumed // determine that the compression process ends with Buddha, determine whether there is any unretrieved Data return (finished & compressedDirectBuf. remaining () = 0 );}

The above is the main step of the compression operation, the decompression operation is similar, do not expand the analysis.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.