In-depth hadoop Research: (8) -- Codec

Source: Internet
Author: User

Reprinted please indicate the source: http://blog.csdn.net/lastsweetop/article/details/9173061

All source code on GitHub, https://github.com/lastsweetop/styhadoop

Introduction codec is actually a acronyms made up of the word headers of coder and decoder. Compressioncodec defines the compression and decompression interfaces. Here we talk about codec, which implements some compression format classes for the compressioncodec interface. Below is a list of these classes:

Compressioncodes can be used to decompress compressioncodec in two ways. Compression: Use the createoutputstream (outputstream out) method to obtain the compressionoutputstream object. decompress the compressioninputstream object. Use the createinputstream (inputstream in) method to obtain the compressioninputstream object compression sample code.
Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. io. ioutils; import Org. apache. hadoop. io. compress. compressioncodec; import Org. apache. hadoop. io. compress. compressionoutputstream; import Org. apache. hadoop. util. reflectionutils;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-25 * Time: * to change this template use file | setting S | file templates. */public class streamcompressor {public static void main (string [] ARGs) throws exception {string codecclassname = ARGs [0]; Class <?> Codecclass = Class. forname (codecclassname); configuration conf = new configuration (); compressioncodec codec = (compressioncodec) reflectionutils. newinstance (codecclass, conf); compressionoutputstream out = codec. createoutputstream (system. out); ioutils. copybytes (system. in, out, 4096, false); out. finish ();}}
Accept a parameter of the compressioncodec implementation class from the command line, instantiate the class through reflectionutils, call the compressioncodec interface method to encapsulate the standard output stream, and encapsulate it into a compressed stream, copy the standard input stream to the compressed stream using the copybytes method of the ioutils class, and call the finish method of compressioncodec to complete compression. Let's take a look at the command line:
echo "Hello lastsweetop" | ~/hadoop/bin/hadoop com.sweetop.styhadoop.StreamCompressor  org.apache.hadoop.io.compress.GzipCodec | gunzip -

Use the gzipcodec class to compress "Hello lastsweetop" and decompress it using the gunzip tool.

Let's take a look at the output:
 [exec] 13/06/26 20:01:53 INFO util.NativeCodeLoader: Loaded the native-hadoop library     [exec] 13/06/26 20:01:53 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library     [exec] Hello lastsweetop
Use compressioncodecfactory to decompress the file. If you want to read a compressed file, you must first use the extension to determine which CODEC to use. You can refer to hadoop's in-depth research: (7) -- compress the corresponding relationship. Of course, there is a simpler way. compressioncodecfactory has already done it for you. You can get the corresponding Codec by passing in a path to call its getcodec method. Let's take a look at the code
Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. ioutils; import Org. apache. hadoop. io. compress. compressioncodec; import Org. apache. hadoop. io. compress. compressioncodecfactory; import Java. io. ioexception; import Java. io. inputstream; import Java. io. outputstream; import java.net. uri;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-26 * Time: * to change this template use file | Settings | file templates. */public class filedecompressor {public static void main (string [] ARGs) throws exception {string uri = ARGs [0]; configuration conf = new configuration (); filesystem FS = filesystem. get (URI. create (URI), conf); Path inputpath = New Path (URI); compressioncodecfactory factory = new compressioncodecfactory (CONF); compressioncodec codec = factory. getcodec (inputpath); If (codec = NULL) {system. out. println ("No codec found for" + URI); system. exit (1);} string outputuri = compressioncodecfactory. removesuffix (Uri, codec. getdefaultextension (); inputstream in = NULL; outputstream out = NULL; try {In = codec. createinputstream (FS. open (inputpath); out = FS. create (New Path (outputuri); ioutils. copybytes (In, out, conf);} finally {ioutils. closestream (in); ioutils. closestream (out );}}}

Pay attention to the removesuffix method. This is a static method. It can remove the file suffix and use this path as the output path for decompression. The codec that compressioncodecfactory can find is also limited. By default, there are only three types of Org. apache. hadoop. io. compress. gzipcodec, org. apache. hadoop. io. compress. bzip2codec, org. apache. hadoop. io. compress. defaultcodec, if you want to add other codec, you need to change Io. compression. codecs attribute and register codec. The concept of native libraries is growing more and more, and HDFS codec is no exception. Native libraries can greatly improve the performance, such as gzip native library decompression by 50% and compression by 10%, however, not all codec instances have native libraries, while some codec instances only have native libraries. Let's take a look at the following list: in Linux, hadoop has compiled 32-bit native libraries and 64-bit native libraries in advance. Let's take a look:

[hadoop@namenode native]$pwd/home/hadoop/hadoop/lib/native[hadoop@namenode native]$ls -lstotal 84 drwxrwxrwx 2 root root 4096 Nov 14  2012 Linux-amd64-644 drwxrwxrwx 2 root root 4096 Nov 14  2012 Linux-i386-32

If it is another platform, you need to compile it by yourself. For detailed steps, refer to compile here:

if [ -d "${HADOOP_HOME}/build/native" -o -d "${HADOOP_HOME}/lib/native" -o -e "${HADOOP_PREFIX}/lib/libhadoop.a" ]; then  if [ -d "$HADOOP_HOME/build/native" ]; then    JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib  fi  if [ -d "${HADOOP_HOME}/lib/native" ]; then    if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then      JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${HADOOP_HOME}/lib/native/${JAVA_PLATFORM}    else      JAVA_LIBRARY_PATH=${HADOOP_HOME}/lib/native/${JAVA_PLATFORM}    fi  fi  if [ -e "${HADOOP_PREFIX}/lib/libhadoop.a" ]; then    JAVA_LIBRARY_PATH=${HADOOP_PREFIX}/lib  fifi

Hadoop will find the corresponding native library and automatically load it. You do not need to care about these settings. But sometimes you don't want to use the native library. For example, when debugging some bugs, you can use hadoop. Native. lib to set it to false. If you use the native library for a lot of compression and decompression, you can consider using codecpool, a bit like a late connection, so that you do not need to frequently create codec objects.

Package COM. sweetop. styhadoop; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. io. ioutils; import Org. apache. hadoop. io. compress. codecpool; import Org. apache. hadoop. io. compress. compressioncodec; import Org. apache. hadoop. io. compress. compressionoutputstream; import Org. apache. hadoop. io. compress. compressor; import Org. apache. hadoop. util. reflectionutils;/*** created with intellij idea. * User: lastsweetop * Date: 13-6-27 * Time: am * to change this template use file | Settings | file templates. */public class pooledstreamcompressor {public static void main (string [] ARGs) throws exception {string codecclassname = ARGs [0]; Class <?> Codecclass = Class. forname (codecclassname); configuration conf = new configuration (); compressioncodec codec = (compressioncodec) reflectionutils. newinstance (codecclass, conf); compressor = NULL; try {compressor = codecpool. getcompressor (codec); compressionoutputstream out = codec. createoutputstream (system. out, compressor); ioutils. copybytes (system. in, out, 4096, false); out. finish ();} finally {codecpool. returncompressor (compressor );}}}

The code is easy to understand. You can use the getcompressor method of codecpool to obtain the compressor object. This method requires passing in a codec, and then the compressor object is used in createoutputstream. after use, the compressor object is returncompressor. The output result is as follows:

 [exec] 13/06/27 12:00:06 INFO util.NativeCodeLoader: Loaded the native-hadoop library     [exec] 13/06/27 12:00:06 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library     [exec] 13/06/27 12:00:06 INFO compress.CodecPool: Got brand-new compressor     [exec] Hello lastsweetop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.