Brief introduction
Codec is actually coder and decoder two words of the prefix composition of the acronym. Compressioncodec defines the compression and decompression interface, we are talking about the codec is implemented Compressioncodec interface some of the compressed format of the class, the following is a list of these classes:
Unzip using Compressioncodecs
Compressioncodec has two ways to easily compress and decompress:
compression: obtains the Compressionoutputstream object through the Createoutputstream (OutputStream out) method.
extract: Obtain the Compressioninputstream object through the Createinputstream (InputStream in) method.
Sample code for compression
[Java]View PlainCopy
- Package com.sweetop.styhadoop;
- Import org.apache.hadoop.conf.Configuration;
- Import Org.apache.hadoop.io.IOUtils;
- Import Org.apache.hadoop.io.compress.CompressionCodec;
- Import Org.apache.hadoop.io.compress.CompressionOutputStream;
- Import Org.apache.hadoop.util.ReflectionUtils;
- /**
- * Created with IntelliJ idea.
- * User:lastsweetop
- * date:13-6-25
- * Time: 10:09
- * To change this template use File | Settings | File Templates.
- */
- Public class Streamcompressor {
- public static void Main (string[] args) throws Exception {
- String codecclassname = args[0];
- class<?> Codecclass = Class.forName (codecclassname);
- Configuration conf = new configuration ();
- Compressioncodec codec = (COMPRESSIONCODEC) reflectionutils.newinstance (codecclass, conf);
- Compressionoutputstream out = Codec.createoutputstream (System.out);
- Ioutils.copybytes (system.in, out, 4096, false);
- Out.finish ();
- }
- }
Receives a COMPRESSIONCODEC implementation class parameter from the command line, and then instantiates the class through reflectionutils, calls the Compressioncodec interface method to encapsulate the standard output stream, encapsulates it into a compressed stream, The standard input stream is copied into the compressed stream by the Ioutils method, and finally the Compressioncodec finish method is called to complete the compression.
Then look at the command line:
[HTML]View PlainCopy
- echo "Hello lastsweetop" | ~/hadoop/bin/hadoop Com.sweetop.styhadoop.StreamCompressor Org.apache.hadoop.io.compress.GzipCodec | Gunzip-
Use the Gzipcode class to compress "Hello lastsweetop" and then unzip it with the Gunzip tool.
Let's take a look at the output:
[HTML]View PlainCopy
- [exec] 13/06/26 20:01:53 INFO util. Nativecodeloader:loaded the Native-hadooplibrary
- [exec] 13/06/26 20:01:53 INFO zlib. Zlibfactory:successfully Loaded & initialized Native-zlib Library
- [EXEC] Hello Lastsweetop
Unzip using Compressioncodecfactory
If you want to read a compressed file, first your first through the extension to determine which kind of codec, of course, there is a more convenient way, Compressioncodecfactory has helped you to do this, by passing in a path to call its Getcodec () Method can obtain the corresponding codec. Let's look at the source code:
[Java]View PlainCopy
- Public class Filedecompressor {
- public static void Main (string[] args) throws Exception {
- String uri = args[0];
- Configuration conf = new configuration ();
- //Get File system
- FileSystem FileSystem = Filesystem.get (Uri.create (URI), conf);
- //Build input path
- Path InputPath = new Path (URI);
- //Create Compressioncodecfactory Object
- Compressioncodecfactory factory = new Compressioncodecfactory (conf);
- //Get the compressed format of the file
- Compressioncodec codec = Factory.getcodec (InputPath);
- //If the compression format does not exist, exit
- if (codec = = null) {
- System.out.println ("No codec found for" + URI);
- System.exit (1);
- }
- //Remove suffix of file, key Outputuri as output path of decompression
- String Outputuri = Compressioncodecfactory.removesuffix (URI, Codec.getdefaultextension ());
- //define input/output stream
- InputStream in = null;
- OutputStream out = null;
- try {
- //Create an input/output stream
- in = Codec.createinputstream (Filesystem.open (InputPath));
- out = Filesystem.create (new Path (Outputuri));
- //Unzip
- Ioutils.copybytes (in, Out, Conf);
- } catch (Exception e) {
- E.printstacktrace ();
- }finally{
- Ioutils.closestream (in);
- Ioutils.closestream (out);
- }
- }
- }
Note the Removesuffix method, which is a static method that removes the suffix of the file, and then we use the path as the extracted output path. The codec that compressioncodefactory can find is also limited, By default there are only three types of Org.apache.hadoop.io.compress.gzipcodec;org.apache.hadoop.io.compress.bzip2codec;org.apache.hadoop.io.compress.defa Ultcodec. If you want to add additional codec you need to change the Io.compression.codecs property and register codec.
Native Library
Now more and more native library concept, HDFs codec is no exception, the native library can greatly improve performance such as gzip extraction of the original library increased by 50%, compression increased by 10%, but not all codec have a native library, and some codec only the native library. Let's look at the following table:
Under Linux, Hadoop had previously compiled 32-bit native libraries and 64-bit native repositories, and we looked at:
[HTML]View PlainCopy
- [[email protected] native] $pwd
- /home/hadoop/hadoop/lib/native
- [[email protected] native] $ls-ls
- Total 8
- 4 drwxrwxrwx 2 root root 4096 Nov linux-amd64-64
- 4 drwxrwxrwx 2 root root 4096 Nov linux-i386-32
If it is a different platform, you will need to compile, detailed steps please see here: Http://wiki.apache.org/hadoop/NativeHadoop
Java Native Library path can be specified by Java.library.path, in the bin directory, Hadoop startup script has been specified, if you do not use this script, then you need to specify in your program.
[HTML]View PlainCopy
- If [-D "${hadoop_home}/build/native"-o-d "${hadoop_home}/lib/native"-o-e "${hadoop_prefix}/lib/libhadoop.a"]; Then
- If [-D "$HADOOP _home/build/native"]; Then
- Java_library_path=${hadoop_home}/build/native/${java_platform}/lib
- Fi
- If [-D "${hadoop_home}/lib/native"]; Then
- If ["x$java_library_path"! = "X"]; Then
- Java_library_path=${java_library_path}:${hadoop_home}/lib/native/${java_platform}
- Else
- Java_library_path=${hadoop_home}/lib/native/${java_platform}
- Fi
- Fi
- If [-E "${hadoop_prefix}/lib/libhadoop.a"]; Then
- Java_library_path=${hadoop_prefix}/lib
- Fi
- Fi
Hadoop will go to find the corresponding native library and automatically load, you don't need to care about these settings. But sometimes you don't want to use a native library, such as debugging some bugs, you can do this by setting the Hadoop.native.lib to False.
If you do a lot of compression and decompression with the native repository, consider using Codecpool, which is a bit like connection pooling, so you don't have to create codec objects frequently.
[Java]View PlainCopy
- Public class Pooledstreamcompressor {
- public static void Main (string[] args) throws classnotfoundexception {
- String codecclassname = args[0];
- //Gets the byte code of the compression class for reflection
- class<?> Codecclass = Class.forName (codecclassname);
- //Create configuration information
- Configuration conf = new configuration ();
- //Create a compression class by reflection mechanism
- Compressioncodec codec = (COMPRESSIONCODEC) reflectionutils.newinstance (codecclass, conf);
- //define compression classes
- Compressor Compressor = null;
- try {
- //Create compressor via Codecpool
- Compressor = Codecpool.getcompressor (codec);
- //Create a compression class object
- Compressionoutputstream out = Codec.createoutputstream (System.out, compressor);
- //Compression
- Ioutils.copybytes (system.in, out,4096, false);
- //Completed
- Out.finish ();
- } catch (Exception e) {
- E.printstacktrace ();
- } finally {
- Codecpool.returncompressor (compressor);
- }
- }
- }
The code is easier to understand, and the compressor object is obtained through the Codecpool Getcompressor () method, which needs to pass in a codec and then compressor the object to be used in Createoutputstream. After use, put it back through Returncompressor ().
The output results are as follows:
[HTML]View PlainCopy
- [exec] 13/06/27 12:00:06 INFO util. Nativecodeloader:loaded The Native-hadoop Library
- [exec] 13/06/27 12:00:06 INFO zlib. Zlibfactory:successfully Loaded & initialized Native-zlib Library
- [exec] 13/06/27 12:00:06 INFO compress. Codecpool:got brand-new Compressor
- [EXEC] Hello Lastsweetop
Originally from: http://blog.csdn.net/lastsweetop/article/details/9173061
Code from: Https://github.com/lastsweetop/styhadoop
Hadoop Compression Codec