Hadoop Compression and decompression

Source: Internet
Author: User
Keywords nbsp; coded can be
1 compression

Generally speaking, the data that the computer processing has some redundancy, at the same time the data, especially the neighboring data have the correlation, so can save the data by some special encoding way different from the original encoding, make the data occupy the storage space is relatively small, this process is called compression. The concept of compression corresponds to decompression, the process of restoring compressed data from a special encoding to the original data.

Compression is widely used in mass data processing, compression of data files, can effectively reduce the space required to store files, and speed up the data on the network or to disk transmission speed. In Hadoop, the compression applies to the data exchange in the file store, the map phase to the reduce phase, and so on.

There are many ways to compress data, and different data compression methods are different: such as the sound and image compression of special data, you can use lossy compression method, allowing the loss of certain information in the compression process, in exchange for a larger compression ratio, and the compression of music data, because the data has its own relatively special coding methods, Therefore, some special data compression algorithms can be used for these specific codes.

Introduction to

2 Hadoop compression

As a general data processing platform, Hadoop mainly considers the compression speed and the fragmentation of compressed files in the use of compression mode.

All compression algorithms consider the trade-off between time and space, and faster compression and decompression speeds usually consume more space (lower compression). For example, when you compress data by using the GZIP command, users can set different options to choose either speed first or space first, and the option –1 indicates the priority of the speed, and the option –9 represents the optimal space, and the maximum compression ratio can be obtained. It should be noted that some compression algorithms compression and decompression speed will be quite different: gzip and zip are common compression tools that are relatively balanced in time/space processing, GZIP2 compression is more efficient than gzip and zip, but slower, and bzip2 faster than its compression speed.

When using MapReduce to process compressed files, you need to consider the fragmentation of compressed files. Considering that we need to handle a 1GB text file that remains on the HDFs, the current HDFs block size is 64MB, the file is stored as 16 blocks, and the corresponding MapReduce job will divide the file into 16 input slices, providing 16 independent The map task is processed. However, if the file is a compressed file in gzip format (unchanged in size), the MapReduce job cannot divide the file into 16 slices because it is not possible to extract data from a point in the gzip data stream. However, if the file is a compressed file in a bzip2 format, the MapReduce job can compress the blocks in the file by bzip2 format, dividing the input into several input slices, and starting at the beginning of the block to decompress the data. In BZIP2 format compressed files, a 48-bit synchronization token is provided between blocks and blocks, so bzip2 supports data segmentation.

Table 3-2 lists some common compression formats and features that you can use for Hadoop.

Table 3-2 the compressed format supported by Hadoop

In order to support multiple compression decompression algorithms, Hadoop introduces the encoding/decoding device. Like the Hadoop serialization framework, an encoding/decoder is also a design pattern that uses an abstract factory. Currently, the encoding/decoder supported by Hadoop is shown in table 3-3.

Table 3-3 compression algorithm and its encoding/decoder

The compression and decompression related tools corresponding to the same compression method can be obtained by the corresponding encoding/decoder.

3 Hadoop Compression API application example? 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970 717273747576777879808182838485868788899091import java.io.ioexception;import java.io.inputstream; import java.io.outputstream;import java.net.uri; import  org.apache.hadoop.conf.configuration;import org.apache.hadoop.fs.fsdatainputstream;import  org.apache.hadoop.fs.fsdataoutputstream;import org.apache.hadoop.fs.filesystem;import  org.apache.hadoop.fs.path;import org.apache.hadoop.io.ioutils;import  org.apache.hadoop.io.compress.compressioncodec;import  org.apache.hadoop.io.compress.compressioncodecfactory;import  org.apache.hadoop.io.compress.compressioninputstream;import  org.apache.hadoop.io.compress.compressionoutputstream;import org.apache.hadoop.util.reflectionutils;  public class CodecTest {    Public static void main (STring[] args)  throws Exception {        Compress (" Org.apache.hadoop.io.compress.BZip2Codec ");//        Compress (" Org.apache.hadoop.io.compress.GzipCodec ");//        Compress (" Org.apache.hadoop.io.compress.Lz4Codec ");//        Compress (" Org.apache.hadoop.io.compress.SnappyCodec ");       // uncompress (" text ");       // UNCOMPRESS1 ("hdfs://master:9000/user/hadoop/text.gz");   }    //   Compressed Files     public static void compress (string codecclassname)  throws  exception {        Class<?> codecclass = class.forname ( Codecclassname);        configuration conf = new configuration ();                 filesystem fs = filesystem.get (conf);        compressioncodec codec =  (COMPRESSIONCODEC)  reflectionutils.newinstance ( codecclass, conf);       //input and output are HDFs path         fsdatainputstream  In = fs.open (New path ("/test.log"));        fsdataoutputstream  Outputstream = fs.create (New path ("/test1.bz2"));                 SYSTEM.OUT.PRINTLN ("compress start !");                 //  create compression output stream         Compressionoutputstream out = codec.createoutputstream (OutputStream);        Ioutils.copybytes (in, out, conf);        Ioutils.closestream (in);        Ioutils.closestream (out);        SYSTEM.OUT.PRINTLN ("compress ok !");    }    //  decompression     Public static void uncompress (string filename)  throws exception {         class<?> codecclass = class                forname ("Org.apache.hadoop.io.compress.GzipCodec");        configuration  Conf = new configuration ();        filesystem fs =  Filesystem.get (conf);        compressioncodec codec =  (COMPRESSIONCODEC)   reflectionutils                newinstance (codecclass, conf);        fsdatainputstream inputstream = fs                Open (New path ("/user/hadoop/text.gz"));       //  unzip the text file to the data and output it to the console         Inputstream in = codec.createinputstream (InputStream);       Ioutils.copybytes (in, system.out, conf);        Ioutils.closestream (in);   }    //  uses the file name extension to infer the codec to decompress the file     public  static void uncompress1 (String uri)  throws IOException {        Configuration conf = new configuration ();        FILESYSTEM FS  = filesystem.get (Uri.create (URI),  conf);         Path inputpath  = new path (URI);        Compressioncodecfactory factory = new  compressioncodecfactory (conf);        compressioncodec codec =  Factory.getcodec (InputPath);        if  (codec == null)  {            SYSTEM.OUT.PRINTLN ("no codec found for "  + uri);    & nbsp;       System.exit (1);       }        string outputuri =  compressioncodecfactory.removesuffix (uri,                Codec.getdefaultextension ());        inputstream in = null;        outputstream out = null;        try {            In = codec.createinputstream (Fs.open (InputPath));            out  = fs.create (New path (Outputuri));            ioutils.copybytes (In, out,  conf);       } finally {            Ioutils.closestream (out);            Ioutils.closestream (in);       }    } 

Original link: http://my.oschina.net/mkh/blog/335297

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.