Hadoop Compression Codec

Source: Internet
Author: User

Brief introduction

Codec is actually coder and decoder two words of the prefix composition of the acronym. Compressioncodec defines the compression and decompression interface, we are talking about the codec is implemented Compressioncodec interface some of the compressed format of the class, the following is a list of these classes:

Unzip using Compressioncodecs

Compressioncodec has two ways to easily compress and decompress:

compression: obtains the Compressionoutputstream object through the Createoutputstream (OutputStream out) method.

extract: Obtain the Compressioninputstream object through the Createinputstream (InputStream in) method.

Sample code for compression

[Java]View PlainCopy
  1. Package com.sweetop.styhadoop;
  2. Import org.apache.hadoop.conf.Configuration;
  3. Import Org.apache.hadoop.io.IOUtils;
  4. Import Org.apache.hadoop.io.compress.CompressionCodec;
  5. Import Org.apache.hadoop.io.compress.CompressionOutputStream;
  6. Import Org.apache.hadoop.util.ReflectionUtils;
  7. /**
  8. * Created with IntelliJ idea.
  9. * User:lastsweetop
  10. * date:13-6-25
  11. * Time: 10:09
  12. * To change this template use File | Settings | File Templates.
  13. */
  14. Public class Streamcompressor {
  15. public static void Main (string[] args) throws Exception {
  16. String codecclassname = args[0];
  17. class<?> Codecclass = Class.forName (codecclassname);
  18. Configuration conf = new configuration ();
  19. Compressioncodec codec = (COMPRESSIONCODEC) reflectionutils.newinstance (codecclass, conf);
  20. Compressionoutputstream out = Codec.createoutputstream (System.out);
  21. Ioutils.copybytes (system.in, out, 4096, false);
  22. Out.finish ();
  23. }
  24. }

Receives a COMPRESSIONCODEC implementation class parameter from the command line, and then instantiates the class through reflectionutils, calls the Compressioncodec interface method to encapsulate the standard output stream, encapsulates it into a compressed stream, The standard input stream is copied into the compressed stream by the Ioutils method, and finally the Compressioncodec finish method is called to complete the compression.

Then look at the command line:

[HTML]View PlainCopy
    1. echo "Hello lastsweetop" | ~/hadoop/bin/hadoop Com.sweetop.styhadoop.StreamCompressor Org.apache.hadoop.io.compress.GzipCodec | Gunzip-

Use the Gzipcode class to compress "Hello lastsweetop" and then unzip it with the Gunzip tool.

Let's take a look at the output:

[HTML]View PlainCopy
    1. [exec] 13/06/26 20:01:53 INFO util. Nativecodeloader:loaded the Native-hadooplibrary
    2. [exec] 13/06/26 20:01:53 INFO zlib. Zlibfactory:successfully Loaded & initialized Native-zlib Library
    3. [EXEC] Hello Lastsweetop


Unzip using Compressioncodecfactory

If you want to read a compressed file, first your first through the extension to determine which kind of codec, of course, there is a more convenient way, Compressioncodecfactory has helped you to do this, by passing in a path to call its Getcodec () Method can obtain the corresponding codec. Let's look at the source code:

[Java]View PlainCopy
  1. Public class Filedecompressor {
  2. public static void Main (string[] args) throws Exception {
  3. String uri = args[0];
  4. Configuration conf = new configuration ();
  5. //Get File system
  6. FileSystem FileSystem = Filesystem.get (Uri.create (URI), conf);
  7. //Build input path
  8. Path InputPath = new Path (URI);
  9. //Create Compressioncodecfactory Object
  10. Compressioncodecfactory factory = new Compressioncodecfactory (conf);
  11. //Get the compressed format of the file
  12. Compressioncodec codec = Factory.getcodec (InputPath);
  13. //If the compression format does not exist, exit
  14. if (codec = = null) {
  15. System.out.println ("No codec found for" + URI);
  16. System.exit (1);
  17. }
  18. //Remove suffix of file, key Outputuri as output path of decompression
  19. String Outputuri = Compressioncodecfactory.removesuffix (URI, Codec.getdefaultextension ());
  20. //define input/output stream
  21. InputStream in = null;
  22. OutputStream out = null;
  23. try {
  24. //Create an input/output stream
  25. in = Codec.createinputstream (Filesystem.open (InputPath));
  26. out = Filesystem.create (new Path (Outputuri));
  27. //Unzip
  28. Ioutils.copybytes (in, Out, Conf);
  29. } catch (Exception e) {
  30. E.printstacktrace ();
  31. }finally{
  32. Ioutils.closestream (in);
  33. Ioutils.closestream (out);
  34. }
  35. }
  36. }

Note the Removesuffix method, which is a static method that removes the suffix of the file, and then we use the path as the extracted output path. The codec that compressioncodefactory can find is also limited, By default there are only three types of Org.apache.hadoop.io.compress.gzipcodec;org.apache.hadoop.io.compress.bzip2codec;org.apache.hadoop.io.compress.defa Ultcodec. If you want to add additional codec you need to change the Io.compression.codecs property and register codec.

Native Library

Now more and more native library concept, HDFs codec is no exception, the native library can greatly improve performance such as gzip extraction of the original library increased by 50%, compression increased by 10%, but not all codec have a native library, and some codec only the native library. Let's look at the following table:

Under Linux, Hadoop had previously compiled 32-bit native libraries and 64-bit native repositories, and we looked at:

[HTML]View PlainCopy
    1. [[email protected] native] $pwd
    2. /home/hadoop/hadoop/lib/native
    3. [[email protected] native] $ls-ls
    4. Total 8
    5. 4 drwxrwxrwx 2 root root 4096 Nov linux-amd64-64
    6. 4 drwxrwxrwx 2 root root 4096 Nov linux-i386-32

If it is a different platform, you will need to compile, detailed steps please see here: Http://wiki.apache.org/hadoop/NativeHadoop
Java Native Library path can be specified by Java.library.path, in the bin directory, Hadoop startup script has been specified, if you do not use this script, then you need to specify in your program.

[HTML]View PlainCopy
  1. If [-D "${hadoop_home}/build/native"-o-d "${hadoop_home}/lib/native"-o-e "${hadoop_prefix}/lib/libhadoop.a"]; Then
  2. If [-D "$HADOOP _home/build/native"]; Then
  3. Java_library_path=${hadoop_home}/build/native/${java_platform}/lib
  4. Fi
  5. If [-D "${hadoop_home}/lib/native"]; Then
  6. If ["x$java_library_path"! = "X"]; Then
  7. Java_library_path=${java_library_path}:${hadoop_home}/lib/native/${java_platform}
  8. Else
  9. Java_library_path=${hadoop_home}/lib/native/${java_platform}
  10. Fi
  11. Fi
  12. If [-E "${hadoop_prefix}/lib/libhadoop.a"]; Then
  13. Java_library_path=${hadoop_prefix}/lib
  14. Fi
  15. Fi

Hadoop will go to find the corresponding native library and automatically load, you don't need to care about these settings. But sometimes you don't want to use a native library, such as debugging some bugs, you can do this by setting the Hadoop.native.lib to False.

If you do a lot of compression and decompression with the native repository, consider using Codecpool, which is a bit like connection pooling, so you don't have to create codec objects frequently.

[Java]View PlainCopy
  1. Public class Pooledstreamcompressor {
  2. public static void Main (string[] args) throws classnotfoundexception {
  3. String codecclassname = args[0];
  4. //Gets the byte code of the compression class for reflection
  5. class<?> Codecclass = Class.forName (codecclassname);
  6. //Create configuration information
  7. Configuration conf = new configuration ();
  8. //Create a compression class by reflection mechanism
  9. Compressioncodec codec = (COMPRESSIONCODEC) reflectionutils.newinstance (codecclass, conf);
  10. //define compression classes
  11. Compressor Compressor = null;
  12. try {
  13. //Create compressor via Codecpool
  14. Compressor = Codecpool.getcompressor (codec);
  15. //Create a compression class object
  16. Compressionoutputstream out = Codec.createoutputstream (System.out, compressor);
  17. //Compression
  18. Ioutils.copybytes (system.in, out,4096, false);
  19. //Completed
  20. Out.finish ();
  21. } catch (Exception e) {
  22. E.printstacktrace ();
  23. } finally {
  24. Codecpool.returncompressor (compressor);
  25. }
  26. }
  27. }

The code is easier to understand, and the compressor object is obtained through the Codecpool Getcompressor () method, which needs to pass in a codec and then compressor the object to be used in Createoutputstream. After use, put it back through Returncompressor ().
The output results are as follows:

[HTML]View PlainCopy
    1. [exec] 13/06/27 12:00:06 INFO util. Nativecodeloader:loaded The Native-hadoop Library
    2. [exec] 13/06/27 12:00:06 INFO zlib. Zlibfactory:successfully Loaded & initialized Native-zlib Library
    3. [exec] 13/06/27 12:00:06 INFO compress. Codecpool:got brand-new Compressor
    4. [EXEC] Hello Lastsweetop



Originally from: http://blog.csdn.net/lastsweetop/article/details/9173061

Code from: Https://github.com/lastsweetop/styhadoop

Hadoop Compression Codec

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.