Compression and decompression mechanisms in HDFS

Source: Internet
Author: User
Overview

We can compress the data files and store them in HDFS to save storage space. However, when using mapreduce to process compressed files, you must consider the severability of the compressed files. Currently, hadoop supports the following compression formats:

Compression format UNIX tools Calculation Method File Extension Support for multiple files Severable
Deflate None Deflate . Deflate No No
Gzip Gzip Deflate . GZ No No
Zip Zip Deflate . Zip Yes Yes
Bzip Bzip2 Bzip2 . Bz2 No Yes
Lzo Lzop Lzo . Lzo No No
To support multiple compression/decompression algorithms, hadoop introduces the encoding/decoder, as shown in the following table.
Compression format Corresponding encoding/Decoder
Deflate Org. Apache. hadoop. Io. Compress. defaultcodec
Gzip Org. Apache. hadoop. Io. Compress. gzipcodec
Bzip Org. Apache. hadoop. Io. Compress. bzipcodec
Snappy Org. Apache. hadoop. Io. Compress. snappycodec
The hadoop compression/Decompression API instance assumes that we have a file part-r-00000 under the Linux local folder, and now uses the hadoop compression/Decompression API to compress and decompress the file.
Import Java. io. file; import Java. io. fileinputstream; import Java. io. filenotfoundexception; import Java. io. fileoutputstream; import Java. io. ioexception; import Java. io. inputstream; import Java. io. outputstream; import Java. util. date; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. ioutils; import Org. apache. hadoop. io. compress. compressioncodec; import org. Apache. hadoop. io. compress. compressioncodecfactory; import Org. apache. hadoop. io. compress. compressionoutputstream; import Org. apache. hadoop. util. reflectionutils; public class Hello {public static void main (string [] ARGs) throws ioexception, interruptedexception, classnotfoundexception {If (ARGs [0]. equals ("compress") {compress (ARGs [1], "org. apache. hadoop. io. compress. "+ ARGs [2]);} else if (ARGs [0]. equa Ls ("decompress") decompres (ARGs [1]); else {system. Err. println ("error! \ N usgae: hadoop jar hello. jar [compress] [filename] [compress type] "); system. err. println ("\ t \ ror [decompress] [filename]"); return;} system. out. println ("down");}/** filename is the original file to be compressed. The method is the compression method to be used (such as bzip2codec) */public static void compress (string filername, string method) throws classnotfoundexception, ioexception {system. out. println ("[" + new date () + "]: Enter compress"); file filein = new file (filername); inputstream in = new fileinputstream (filein ); class codecclass = Class. forname (method); configuration conf = new configuration (); // find the corresponding encoding/decoder compressioncodec codec = (compressioncodec) reflectionutils by name. newinstance (codecclass, conf); // the file extension file fileout = new file (filername + codec. getdefaultextension (); fileout. delete (); outputstream out = new fileoutputstream (fileout); compressionoutputstream cout = codec. createoutputstream (out); system. out. println ("[" + new date () + "]: start compressing"); ioutils. copybytes (in, cout, 1024*1024*5, false); // set the buffer to 5mbsystem. out. println ("[" + new date () + "]: compressing finished"); In. close (); cout. close ();}/** filename is the file to be decompressed */public static void decompres (string filename) throws filenotfoundexception, ioexception {system. out. println ("[" + new date () + "]: Enter compress"); configuration conf = new configuration (); compressioncodecfactory factory = new compressioncodecfactory (CONF ); compressioncodec codec = factory. getcodec (New Path (filename); If (null = codec) {system. out. println ("cannot find codec for file" + filename); return;} file fout = new file (filename + ". decoded "); inputstream CIN = codec. createinputstream (New fileinputstream (filename); outputstream out = new fileoutputstream (fout); system. out. println ("[" + new date () + "]: Start decompressing"); ioutils. copybytes (CIN, out, 1024*1024*5, false); system. out. println ("[" + new date () + "]: decompressing finished"); cin. close (); out. close ();}}

The reflectionutils and ioutils classes are defined in the org. Apache. hadoop. util package.

We set hello. java class packaged as hello. jar, put it in the same path as the part-r-00000, and then compress hadoop jar hello. jar compress part-r-00000 bzip2codeccan see, the slave generated part-r-00000.bz2 file, and then decompress hadoop jar hello. jar decompress part-r-00000.bz2can be viewed, and part-r-00000.bz2 is generated in this example. decoded File

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.