Hadoop Compression Codec

Last Update:2016-01-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Brief introduction

Codec is actually coder and decoder two words of the prefix composition of the acronym. Compressioncodec defines the compression and decompression interface, we are talking about the codec is implemented Compressioncodec interface some of the compressed format of the class, the following is a list of these classes:

Unzip using Compressioncodecs

Compressioncodec has two ways to easily compress and decompress:

compression: obtains the Compressionoutputstream object through the Createoutputstream (OutputStream out) method.

extract: Obtain the Compressioninputstream object through the Createinputstream (InputStream in) method.

Sample code for compression

[Java]View PlainCopy

Package com.sweetop.styhadoop;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.io.IOUtils;
Import Org.apache.hadoop.io.compress.CompressionCodec;
Import Org.apache.hadoop.io.compress.CompressionOutputStream;
Import Org.apache.hadoop.util.ReflectionUtils;
/**
* Created with IntelliJ idea.
* User:lastsweetop
* date:13-6-25
* Time: 10:09
* To change this template use File | Settings | File Templates.
*/
Public class Streamcompressor {
public static void Main (string[] args) throws Exception {
String codecclassname = args[0];
class<?> Codecclass = Class.forName (codecclassname);
Configuration conf = new configuration ();
Compressioncodec codec = (COMPRESSIONCODEC) reflectionutils.newinstance (codecclass, conf);
Compressionoutputstream out = Codec.createoutputstream (System.out);
Ioutils.copybytes (system.in, out, 4096, false);
Out.finish ();
}
}

Receives a COMPRESSIONCODEC implementation class parameter from the command line, and then instantiates the class through reflectionutils, calls the Compressioncodec interface method to encapsulate the standard output stream, encapsulates it into a compressed stream, The standard input stream is copied into the compressed stream by the Ioutils method, and finally the Compressioncodec finish method is called to complete the compression.

Then look at the command line:

[HTML]View PlainCopy

echo "Hello lastsweetop" | ~/hadoop/bin/hadoop Com.sweetop.styhadoop.StreamCompressor Org.apache.hadoop.io.compress.GzipCodec | Gunzip-

Use the Gzipcode class to compress "Hello lastsweetop" and then unzip it with the Gunzip tool.

Let's take a look at the output:

[HTML]View PlainCopy

[exec] 13/06/26 20:01:53 INFO util. Nativecodeloader:loaded the Native-hadooplibrary
[exec] 13/06/26 20:01:53 INFO zlib. Zlibfactory:successfully Loaded & initialized Native-zlib Library
[EXEC] Hello Lastsweetop

Unzip using Compressioncodecfactory

If you want to read a compressed file, first your first through the extension to determine which kind of codec, of course, there is a more convenient way, Compressioncodecfactory has helped you to do this, by passing in a path to call its Getcodec () Method can obtain the corresponding codec. Let's look at the source code:

[Java]View PlainCopy

Public class Filedecompressor {
public static void Main (string[] args) throws Exception {
String uri = args[0];
Configuration conf = new configuration ();
//Get File system
FileSystem FileSystem = Filesystem.get (Uri.create (URI), conf);
//Build input path
Path InputPath = new Path (URI);
//Create Compressioncodecfactory Object
Compressioncodecfactory factory = new Compressioncodecfactory (conf);
//Get the compressed format of the file
Compressioncodec codec = Factory.getcodec (InputPath);
//If the compression format does not exist, exit
if (codec = = null) {
System.out.println ("No codec found for" + URI);
System.exit (1);
}
//Remove suffix of file, key Outputuri as output path of decompression
String Outputuri = Compressioncodecfactory.removesuffix (URI, Codec.getdefaultextension ());
//define input/output stream
InputStream in = null;
OutputStream out = null;
try {
//Create an input/output stream
in = Codec.createinputstream (Filesystem.open (InputPath));
out = Filesystem.create (new Path (Outputuri));
//Unzip
Ioutils.copybytes (in, Out, Conf);
} catch (Exception e) {
E.printstacktrace ();
}finally{
Ioutils.closestream (in);
Ioutils.closestream (out);
}
}
}

Note the Removesuffix method, which is a static method that removes the suffix of the file, and then we use the path as the extracted output path. The codec that compressioncodefactory can find is also limited, By default there are only three types of Org.apache.hadoop.io.compress.gzipcodec;org.apache.hadoop.io.compress.bzip2codec;org.apache.hadoop.io.compress.defa Ultcodec. If you want to add additional codec you need to change the Io.compression.codecs property and register codec.

Native Library

Now more and more native library concept, HDFs codec is no exception, the native library can greatly improve performance such as gzip extraction of the original library increased by 50%, compression increased by 10%, but not all codec have a native library, and some codec only the native library. Let's look at the following table:

Under Linux, Hadoop had previously compiled 32-bit native libraries and 64-bit native repositories, and we looked at:

[HTML]View PlainCopy

[[email protected] native] $pwd
/home/hadoop/hadoop/lib/native
[[email protected] native] $ls-ls
Total 8
4 drwxrwxrwx 2 root root 4096 Nov linux-amd64-64
4 drwxrwxrwx 2 root root 4096 Nov linux-i386-32

If it is a different platform, you will need to compile, detailed steps please see here: Http://wiki.apache.org/hadoop/NativeHadoop
Java Native Library path can be specified by Java.library.path, in the bin directory, Hadoop startup script has been specified, if you do not use this script, then you need to specify in your program.

[HTML]View PlainCopy

If [-D "${hadoop_home}/build/native"-o-d "${hadoop_home}/lib/native"-o-e "${hadoop_prefix}/lib/libhadoop.a"]; Then
If [-D "$HADOOP _home/build/native"]; Then
Java_library_path=${hadoop_home}/build/native/${java_platform}/lib
Fi
If [-D "${hadoop_home}/lib/native"]; Then
If ["x$java_library_path"! = "X"]; Then
Java_library_path=${java_library_path}:${hadoop_home}/lib/native/${java_platform}
Else
Java_library_path=${hadoop_home}/lib/native/${java_platform}
Fi
Fi
If [-E "${hadoop_prefix}/lib/libhadoop.a"]; Then
Java_library_path=${hadoop_prefix}/lib
Fi
Fi

Hadoop will go to find the corresponding native library and automatically load, you don't need to care about these settings. But sometimes you don't want to use a native library, such as debugging some bugs, you can do this by setting the Hadoop.native.lib to False.

If you do a lot of compression and decompression with the native repository, consider using Codecpool, which is a bit like connection pooling, so you don't have to create codec objects frequently.

[Java]View PlainCopy

Public class Pooledstreamcompressor {
public static void Main (string[] args) throws classnotfoundexception {
String codecclassname = args[0];
//Gets the byte code of the compression class for reflection
class<?> Codecclass = Class.forName (codecclassname);
//Create configuration information
Configuration conf = new configuration ();
//Create a compression class by reflection mechanism
Compressioncodec codec = (COMPRESSIONCODEC) reflectionutils.newinstance (codecclass, conf);
//define compression classes
Compressor Compressor = null;
try {
//Create compressor via Codecpool
Compressor = Codecpool.getcompressor (codec);
//Create a compression class object
Compressionoutputstream out = Codec.createoutputstream (System.out, compressor);
//Compression
Ioutils.copybytes (system.in, out,4096, false);
//Completed
Out.finish ();
} catch (Exception e) {
E.printstacktrace ();
} finally {
Codecpool.returncompressor (compressor);
}
}
}

The code is easier to understand, and the compressor object is obtained through the Codecpool Getcompressor () method, which needs to pass in a codec and then compressor the object to be used in Createoutputstream. After use, put it back through Returncompressor ().
The output results are as follows:

[HTML]View PlainCopy

[exec] 13/06/27 12:00:06 INFO util. Nativecodeloader:loaded The Native-hadoop Library
[exec] 13/06/27 12:00:06 INFO zlib. Zlibfactory:successfully Loaded & initialized Native-zlib Library
[exec] 13/06/27 12:00:06 INFO compress. Codecpool:got brand-new Compressor
[EXEC] Hello Lastsweetop

Originally from: http://blog.csdn.net/lastsweetop/article/details/9173061

Code from: Https://github.com/lastsweetop/styhadoop

Hadoop Compression Codec

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop Compression Codec

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop Compression Codec

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support