Reprinted please indicate the source: http://blog.csdn.net/lastsweetop/article/details/9173061
All source code on GitHub, https://github.com/lastsweetop/styhadoop
Introduction codec is actually a acronyms made up of the word headers of coder and decoder. Compressioncodec defines the compression and decompression interfaces. Here we talk about codec, which implements some compression format classes for the compressioncodec interface. Below is a list of these classes:
Compressioncodes can be used to decompress compressioncodec in two ways. Compression: Use the createoutputstream (outputstream out) method to obtain the compressionoutputstream object. decompress the compressioninputstream object. Use the createinputstream (inputstream in) method to obtain the compressioninputstream object compression sample code.
Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. io. ioutils; import Org. apache. hadoop. io. compress. compressioncodec; import Org. apache. hadoop. io. compress. compressionoutputstream; import Org. apache. hadoop. util. reflectionutils;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-25 * Time: * to change this template use file | setting S | file templates. */public class streamcompressor {public static void main (string [] ARGs) throws exception {string codecclassname = ARGs [0]; Class <?> Codecclass = Class. forname (codecclassname); configuration conf = new configuration (); compressioncodec codec = (compressioncodec) reflectionutils. newinstance (codecclass, conf); compressionoutputstream out = codec. createoutputstream (system. out); ioutils. copybytes (system. in, out, 4096, false); out. finish ();}}
Accept a parameter of the compressioncodec implementation class from the command line, instantiate the class through reflectionutils, call the compressioncodec interface method to encapsulate the standard output stream, and encapsulate it into a compressed stream, copy the standard input stream to the compressed stream using the copybytes method of the ioutils class, and call the finish method of compressioncodec to complete compression. Let's take a look at the command line:
echo "Hello lastsweetop" | ~/hadoop/bin/hadoop com.sweetop.styhadoop.StreamCompressor org.apache.hadoop.io.compress.GzipCodec | gunzip -
Use the gzipcodec class to compress "Hello lastsweetop" and decompress it using the gunzip tool.
Let's take a look at the output:
[exec] 13/06/26 20:01:53 INFO util.NativeCodeLoader: Loaded the native-hadoop library [exec] 13/06/26 20:01:53 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library [exec] Hello lastsweetop
Use compressioncodecfactory to decompress the file. If you want to read a compressed file, you must first use the extension to determine which CODEC to use. You can refer to hadoop's in-depth research: (7) -- compress the corresponding relationship. Of course, there is a simpler way. compressioncodecfactory has already done it for you. You can get the corresponding Codec by passing in a path to call its getcodec method. Let's take a look at the code
Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. ioutils; import Org. apache. hadoop. io. compress. compressioncodec; import Org. apache. hadoop. io. compress. compressioncodecfactory; import Java. io. ioexception; import Java. io. inputstream; import Java. io. outputstream; import java.net. uri;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-26 * Time: * to change this template use file | Settings | file templates. */public class filedecompressor {public static void main (string [] ARGs) throws exception {string uri = ARGs [0]; configuration conf = new configuration (); filesystem FS = filesystem. get (URI. create (URI), conf); Path inputpath = New Path (URI); compressioncodecfactory factory = new compressioncodecfactory (CONF); compressioncodec codec = factory. getcodec (inputpath); If (codec = NULL) {system. out. println ("No codec found for" + URI); system. exit (1);} string outputuri = compressioncodecfactory. removesuffix (Uri, codec. getdefaultextension (); inputstream in = NULL; outputstream out = NULL; try {In = codec. createinputstream (FS. open (inputpath); out = FS. create (New Path (outputuri); ioutils. copybytes (In, out, conf);} finally {ioutils. closestream (in); ioutils. closestream (out );}}}
Pay attention to the removesuffix method. This is a static method. It can remove the file suffix and use this path as the output path for decompression. The codec that compressioncodecfactory can find is also limited. By default, there are only three types of Org. apache. hadoop. io. compress. gzipcodec, org. apache. hadoop. io. compress. bzip2codec, org. apache. hadoop. io. compress. defaultcodec, if you want to add other codec, you need to change Io. compression. codecs attribute and register codec. The concept of native libraries is growing more and more, and HDFS codec is no exception. Native libraries can greatly improve the performance, such as gzip native library decompression by 50% and compression by 10%, however, not all codec instances have native libraries, while some codec instances only have native libraries. Let's take a look at the following list: in Linux, hadoop has compiled 32-bit native libraries and 64-bit native libraries in advance. Let's take a look:
[hadoop@namenode native]$pwd/home/hadoop/hadoop/lib/native[hadoop@namenode native]$ls -lstotal 84 drwxrwxrwx 2 root root 4096 Nov 14 2012 Linux-amd64-644 drwxrwxrwx 2 root root 4096 Nov 14 2012 Linux-i386-32
If it is another platform, you need to compile it by yourself. For detailed steps, refer to compile here:
if [ -d "${HADOOP_HOME}/build/native" -o -d "${HADOOP_HOME}/lib/native" -o -e "${HADOOP_PREFIX}/lib/libhadoop.a" ]; then if [ -d "$HADOOP_HOME/build/native" ]; then JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib fi if [ -d "${HADOOP_HOME}/lib/native" ]; then if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${HADOOP_HOME}/lib/native/${JAVA_PLATFORM} else JAVA_LIBRARY_PATH=${HADOOP_HOME}/lib/native/${JAVA_PLATFORM} fi fi if [ -e "${HADOOP_PREFIX}/lib/libhadoop.a" ]; then JAVA_LIBRARY_PATH=${HADOOP_PREFIX}/lib fifi
Hadoop will find the corresponding native library and automatically load it. You do not need to care about these settings. But sometimes you don't want to use the native library. For example, when debugging some bugs, you can use hadoop. Native. lib to set it to false. If you use the native library for a lot of compression and decompression, you can consider using codecpool, a bit like a late connection, so that you do not need to frequently create codec objects.
Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. io. ioutils; import Org. apache. hadoop. io. compress. codecpool; import Org. apache. hadoop. io. compress. compressioncodec; import Org. apache. hadoop. io. compress. compressionoutputstream; import Org. apache. hadoop. io. compress. compressor; import Org. apache. hadoop. util. reflectionutils;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-27 * Time: am * to change this template use file | Settings | file templates. */public class pooledstreamcompressor {public static void main (string [] ARGs) throws exception {string codecclassname = ARGs [0]; Class <?> Codecclass = Class. forname (codecclassname); configuration conf = new configuration (); compressioncodec codec = (compressioncodec) reflectionutils. newinstance (codecclass, conf); compressor = NULL; try {compressor = codecpool. getcompressor (codec); compressionoutputstream out = codec. createoutputstream (system. out, compressor); ioutils. copybytes (system. in, out, 4096, false); out. finish ();} finally {codecpool. returncompressor (compressor );}}}
The code is easy to understand. You can use the getcompressor method of codecpool to obtain the compressor object. This method requires passing in a codec, and then the compressor object is used in createoutputstream. after use, the compressor object is returncompressor. The output result is as follows:
[exec] 13/06/27 12:00:06 INFO util.NativeCodeLoader: Loaded the native-hadoop library [exec] 13/06/27 12:00:06 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library [exec] 13/06/27 12:00:06 INFO compress.CodecPool: Got brand-new compressor [exec] Hello lastsweetop