1.CompressionCodecFactory Introduction
When reading a compressed file, you may not know which compression algorithm is used to compress the file, then the decompression task cannot be completed. In Hadoop, compressioncodecfactory by using its Getcodec () method, you can map to a Compressioncodec class with its corresponding file name extension. such as README.txt.gz through the Getcodec () method, after the Gipcodec class. For the compression of Hadoop, refer to my Blog "Hadoop compression" http://www.cnblogs.com/robert-blue/p/4155710.html
Example: Extracting a file by using a codec inferred from the file name extension
1 PackageCn.roboson.codec;2 3 Importjava.io.IOException;4 5 Importorg.apache.hadoop.conf.Configuration;6 ImportOrg.apache.hadoop.fs.FSDataInputStream;7 ImportOrg.apache.hadoop.fs.FSDataOutputStream;8 ImportOrg.apache.hadoop.fs.FileStatus;9 ImportOrg.apache.hadoop.fs.FileSystem;Ten ImportOrg.apache.hadoop.fs.Path; One Importorg.apache.hadoop.io.IOUtils; A ImportOrg.apache.hadoop.io.compress.CompressionCodec; - Importorg.apache.hadoop.io.compress.CompressionCodecFactory; - ImportOrg.apache.hadoop.io.compress.CompressionInputStream; the - /* - * Inferred by compressioncodefactory Compressioncodec - * 1. First upload a. gz suffix file to Hadoop + * 2. Infer the compression algorithm used by the file suffix - * 3. Unzip the uploaded compressed file into a unified directory + */ A Public classStreamCompressor02 { at - Public Static voidMain (string[] args) { - -Configuration conf =NewConfiguration (); -Conf.addresource ("Core-site.xml"); - in Try { -FileSystem fs =filesystem.get (conf); to + //Local File -String localsrc= "/home/roboson/Desktop/readme.txt.gz"; thePath localpath=NewPath (LOCALSRC); * $ //Destination PathPanax NotoginsengString hadoopdsc= "/roboson/readme.txt.gz"; -Path Hadooppath =NewPath (HADOOPDSC); the + //List of files under/roboson directory before copying Afilestatus[] files = Fs.liststatus (NewPath ("/roboson/")); theSystem.out.println ("Before copying:"); + for(Filestatus filestatus:files) { - System.out.println (Filestatus.getpath ()); $ } $ - //copying local files into the Hadoop file system - Fs.copyfromlocalfile (localpath,hadooppath); the - //List of files under/roboson directory after copyingWuyiFiles = Fs.liststatus (NewPath ("/roboson/")); theSystem.out.println ("After copy:"); - for(Filestatus filestatus:files) { Wu System.out.println (Filestatus.getpath ()); - } About $ //get an instance of Compressioncodecfactory to infer which compression algorithm -Compressioncodecfactory Facotry =Newcompressioncodecfactory (conf); - - //a compression class is inferred from Compressioncodecfactory for extracting ACompressioncodec codec =Facotry.getcodec (hadooppath); + if(codec==NULL){ theSystem.out.println ("No such compression found"); -System.exit (1); $ } the the /* the * 1.CompressionCodecFactory Removesuffix () is used to return a filename, this file name = = Compressed file suffix name removed the * If README.txt.gz calls the Removesuffix () method, the return is README.txt - * in * The 2.CompressionCodec getdefaultextension () method returns the compression extension of a compression algorithm, such as gzip. GZ the */ theString uncodecurl=Facotry.removesuffix (HADOOPDSC, Codec.getdefaultextension ()); AboutSYSTEM.OUT.PRINTLN ("Compression algorithm's generated file extension:" +codec.getdefaultextension ()); theSYSTEM.OUT.PRINTLN ("filename generated after decompression:" +uncodecurl); the the //Create an unpacked file in Hadoop +Fsdataoutputstream out = Fs.create (NewPath (Uncodecurl)); - the //Create the input data stream and extract the data read from the input data stream using the Compressioncodec Createinputstream () methodBayiFsdatainputstream in = Fs.open (NewPath (HADOOPDSC)); theCompressioninputstream codecin =Codec.createinputstream (in); the - //writes an input data stream to the output data stream -Ioutils.copybytes (Codecin, out, Conf,true); the the //List of files in the/roboson directory after decompression theFiles = Fs.liststatus (NewPath ("/roboson/")); theSystem.out.println ("after decompression"); - for(Filestatus filestatus:files) { the System.out.println (Filestatus.getpath ()); the } the 94 //See what's been unpacked theSystem.out.println ("Extracted content:"); theIn=fs.open (NewPath (Uncodecurl)); theIoutils.copybytes (in,system.out, Conf,true);98}Catch(IOException e) { About //TODO auto-generated Catch block - e.printstacktrace ();101 }102 }103}
Operation Result:
2. Native Class Library
What is a native class library? is the local class library (native), for example, Java also has the implementation of compression and decompression, but as shown above, the use of Gzip is not Java implementation, but the Linux system comes with, we all know that gzip is a common compression tool in Linux. Hadoop itself contains a ZIP code library (located in the Lib/native directory) built with 32-bit and 64-bit Linux. For other platforms, the code base needs to be compiled according to the instructions of the Hadoop wiki.
The native code base can be specified using the Java.library.path property of the Java system. This property can be set by the Hadoop script in the Bin folder. By default, Hadoop searches the native codebase based on the platform on which it runs, and automatically loads if the corresponding code base is found. Therefore, you do not need to modify any settings in order to use the OST code base. And for performance, it's best to use native class libraries for compression and decompression because it's more efficient! What are the native implementations of compression in Hadoop and what are Java implementations?
implementation of the compression code base
| Compression format |
Java implementation |
Native implementation |
| DEFLATE |
Is |
Is |
| Gzip |
Is |
Is |
| Gzip2 |
Is |
Whether |
| LZO |
Whether |
Is |
Compressioncodecfactory of Hadoop compression