Overview
We can compress the data files and store them in HDFS to save storage space. However, when using mapreduce to process compressed files, you must consider the severability of the compressed files. Currently, hadoop supports the following compression formats:
| Compression format |
UNIX tools |
Calculation Method |
File Extension |
Support for multiple files |
Severable |
| Deflate |
None |
Deflate |
. Deflate |
No |
No |
| Gzip |
Gzip |
Deflate |
. GZ |
No |
No |
| Zip |
Zip |
Deflate |
. Zip |
Yes |
Yes |
| Bzip |
Bzip2 |
Bzip2 |
. Bz2 |
No |
Yes |
| Lzo |
Lzop |
Lzo |
. Lzo |
No |
No |
To support multiple compression/decompression algorithms, hadoop introduces the encoding/decoder, as shown in the following table.
| Compression format |
Corresponding encoding/Decoder |
| Deflate |
Org. Apache. hadoop. Io. Compress. defaultcodec |
| Gzip |
Org. Apache. hadoop. Io. Compress. gzipcodec |
| Bzip |
Org. Apache. hadoop. Io. Compress. bzipcodec |
| Snappy |
Org. Apache. hadoop. Io. Compress. snappycodec |
The hadoop compression/Decompression API instance assumes that we have a file part-r-00000 under the Linux local folder, and now uses the hadoop compression/Decompression API to compress and decompress the file.
Import Java. io. file; import Java. io. fileinputstream; import Java. io. filenotfoundexception; import Java. io. fileoutputstream; import Java. io. ioexception; import Java. io. inputstream; import Java. io. outputstream; import Java. util. date; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. ioutils; import Org. apache. hadoop. io. compress. compressioncodec; import org. Apache. hadoop. io. compress. compressioncodecfactory; import Org. apache. hadoop. io. compress. compressionoutputstream; import Org. apache. hadoop. util. reflectionutils; public class Hello {public static void main (string [] ARGs) throws ioexception, interruptedexception, classnotfoundexception {If (ARGs [0]. equals ("compress") {compress (ARGs [1], "org. apache. hadoop. io. compress. "+ ARGs [2]);} else if (ARGs [0]. equa Ls ("decompress") decompres (ARGs [1]); else {system. Err. println ("error! \ N usgae: hadoop jar hello. jar [compress] [filename] [compress type] "); system. err. println ("\ t \ ror [decompress] [filename]"); return;} system. out. println ("down");}/** filename is the original file to be compressed. The method is the compression method to be used (such as bzip2codec) */public static void compress (string filername, string method) throws classnotfoundexception, ioexception {system. out. println ("[" + new date () + "]: Enter compress"); file filein = new file (filername); inputstream in = new fileinputstream (filein ); class codecclass = Class. forname (method); configuration conf = new configuration (); // find the corresponding encoding/decoder compressioncodec codec = (compressioncodec) reflectionutils by name. newinstance (codecclass, conf); // the file extension file fileout = new file (filername + codec. getdefaultextension (); fileout. delete (); outputstream out = new fileoutputstream (fileout); compressionoutputstream cout = codec. createoutputstream (out); system. out. println ("[" + new date () + "]: start compressing"); ioutils. copybytes (in, cout, 1024*1024*5, false); // set the buffer to 5mbsystem. out. println ("[" + new date () + "]: compressing finished"); In. close (); cout. close ();}/** filename is the file to be decompressed */public static void decompres (string filename) throws filenotfoundexception, ioexception {system. out. println ("[" + new date () + "]: Enter compress"); configuration conf = new configuration (); compressioncodecfactory factory = new compressioncodecfactory (CONF ); compressioncodec codec = factory. getcodec (New Path (filename); If (null = codec) {system. out. println ("cannot find codec for file" + filename); return;} file fout = new file (filename + ". decoded "); inputstream CIN = codec. createinputstream (New fileinputstream (filename); outputstream out = new fileoutputstream (fout); system. out. println ("[" + new date () + "]: Start decompressing"); ioutils. copybytes (CIN, out, 1024*1024*5, false); system. out. println ("[" + new date () + "]: decompressing finished"); cin. close (); out. close ();}}
The reflectionutils and ioutils classes are defined in the org. Apache. hadoop. util package.
We set hello. java class packaged as hello. jar, put it in the same path as the part-r-00000, and then compress hadoop jar hello. jar compress part-r-00000 bzip2codeccan see, the slave generated part-r-00000.bz2 file, and then decompress hadoop jar hello. jar decompress part-r-00000.bz2can be viewed, and part-r-00000.bz2 is generated in this example. decoded File