Compressioncodecfactory of Hadoop compression

Source: Internet
Author: User

1.CompressionCodecFactory Introduction

When reading a compressed file, you may not know which compression algorithm is used to compress the file, then the decompression task cannot be completed. In Hadoop, compressioncodecfactory by using its Getcodec () method, you can map to a Compressioncodec class with its corresponding file name extension. such as README.txt.gz through the Getcodec () method, after the Gipcodec class. For the compression of Hadoop, refer to my Blog "Hadoop compression" http://www.cnblogs.com/robert-blue/p/4155710.html

Example: Extracting a file by using a codec inferred from the file name extension

1  PackageCn.roboson.codec;2 3 Importjava.io.IOException;4 5 Importorg.apache.hadoop.conf.Configuration;6 ImportOrg.apache.hadoop.fs.FSDataInputStream;7 ImportOrg.apache.hadoop.fs.FSDataOutputStream;8 ImportOrg.apache.hadoop.fs.FileStatus;9 ImportOrg.apache.hadoop.fs.FileSystem;Ten ImportOrg.apache.hadoop.fs.Path; One Importorg.apache.hadoop.io.IOUtils; A ImportOrg.apache.hadoop.io.compress.CompressionCodec; - Importorg.apache.hadoop.io.compress.CompressionCodecFactory; - ImportOrg.apache.hadoop.io.compress.CompressionInputStream; the  - /* - * Inferred by compressioncodefactory Compressioncodec - * 1. First upload a. gz suffix file to Hadoop + * 2. Infer the compression algorithm used by the file suffix - * 3. Unzip the uploaded compressed file into a unified directory +  */ A  Public classStreamCompressor02 { at      -      Public Static voidMain (string[] args) { -          -Configuration conf =NewConfiguration (); -Conf.addresource ("Core-site.xml"); -          in         Try { -FileSystem fs =filesystem.get (conf); to              +             //Local File -String localsrc= "/home/roboson/Desktop/readme.txt.gz"; thePath localpath=NewPath (LOCALSRC); *              $             //Destination PathPanax NotoginsengString hadoopdsc= "/roboson/readme.txt.gz"; -Path Hadooppath =NewPath (HADOOPDSC); the              +             //List of files under/roboson directory before copying Afilestatus[] files = Fs.liststatus (NewPath ("/roboson/")); theSystem.out.println ("Before copying:"); +              for(Filestatus filestatus:files) { - System.out.println (Filestatus.getpath ()); $             } $              -             //copying local files into the Hadoop file system - Fs.copyfromlocalfile (localpath,hadooppath); the              -             //List of files under/roboson directory after copyingWuyiFiles = Fs.liststatus (NewPath ("/roboson/")); theSystem.out.println ("After copy:"); -              for(Filestatus filestatus:files) { Wu System.out.println (Filestatus.getpath ()); -             } About              $             //get an instance of Compressioncodecfactory to infer which compression algorithm -Compressioncodecfactory Facotry =Newcompressioncodecfactory (conf); -              -             //a compression class is inferred from Compressioncodecfactory for extracting ACompressioncodec codec =Facotry.getcodec (hadooppath); +             if(codec==NULL){ theSystem.out.println ("No such compression found"); -System.exit (1); $             } the              the             /* the * 1.CompressionCodecFactory Removesuffix () is used to return a filename, this file name = = Compressed file suffix name removed the * If README.txt.gz calls the Removesuffix () method, the return is README.txt -              *  in * The 2.CompressionCodec getdefaultextension () method returns the compression extension of a compression algorithm, such as gzip. GZ the              */ theString uncodecurl=Facotry.removesuffix (HADOOPDSC, Codec.getdefaultextension ()); AboutSYSTEM.OUT.PRINTLN ("Compression algorithm's generated file extension:" +codec.getdefaultextension ()); theSYSTEM.OUT.PRINTLN ("filename generated after decompression:" +uncodecurl); the              the             //Create an unpacked file in Hadoop +Fsdataoutputstream out = Fs.create (NewPath (Uncodecurl)); -              the             //Create the input data stream and extract the data read from the input data stream using the Compressioncodec Createinputstream () methodBayiFsdatainputstream in = Fs.open (NewPath (HADOOPDSC)); theCompressioninputstream codecin =Codec.createinputstream (in); the              -             //writes an input data stream to the output data stream -Ioutils.copybytes (Codecin, out, Conf,true); the              the             //List of files in the/roboson directory after decompression theFiles = Fs.liststatus (NewPath ("/roboson/")); theSystem.out.println ("after decompression"); -              for(Filestatus filestatus:files) { the System.out.println (Filestatus.getpath ()); the             } the             94             //See what's been unpacked theSystem.out.println ("Extracted content:"); theIn=fs.open (NewPath (Uncodecurl)); theIoutils.copybytes (in,system.out, Conf,true);98}Catch(IOException e) { About             //TODO auto-generated Catch block - e.printstacktrace ();101         }102     }103}

Operation Result:

2. Native Class Library

What is a native class library? is the local class library (native), for example, Java also has the implementation of compression and decompression, but as shown above, the use of Gzip is not Java implementation, but the Linux system comes with, we all know that gzip is a common compression tool in Linux. Hadoop itself contains a ZIP code library (located in the Lib/native directory) built with 32-bit and 64-bit Linux. For other platforms, the code base needs to be compiled according to the instructions of the Hadoop wiki.

The native code base can be specified using the Java.library.path property of the Java system. This property can be set by the Hadoop script in the Bin folder. By default, Hadoop searches the native codebase based on the platform on which it runs, and automatically loads if the corresponding code base is found. Therefore, you do not need to modify any settings in order to use the OST code base. And for performance, it's best to use native class libraries for compression and decompression because it's more efficient! What are the native implementations of compression in Hadoop and what are Java implementations?

implementation of the compression code base
Compression format Java implementation Native implementation
DEFLATE Is Is
Gzip Is Is
Gzip2 Is Whether
LZO Whether

Is

Compressioncodecfactory of Hadoop compression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.