The compression algorithm of Hadoop

Last Update:2015-06-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

compression of common data compression algorithms

There are two main advantages of file compression, one is to reduce the space for storing files, and the other is to speed up data transmission. In the context of Hadoop big data, these two points are especially important, so I'm going to look at the file compression in Hadoop. There are many compression formats supported in Hadoop, so let's look at a table:

The Lzo and LZ4 algorithms have not been used in hadoop1.x.

1, deflate is a lossless data compression algorithm using both LZ77 and Huffman codes,
The source code can be found in the zlib library . Gzip is an algorithm based on the DEFLATE algorithm.
2, compression algorithm is the conversion of space and time, faster compression time or a smaller compression ratio. Can be specified by parameters, 1 means speed, 9 means space.
Taking gzip as an example, the following means a faster retraction: gzip -1 file
3, gzip in time and space is relatively moderate, bzip2 compression than gzip more effective, but slower. Bzip2 's decompression speed is faster than it compresses. However, compared with other compression formats is the slowest, but the compression effect is clearly the best . Snappy and LZ4 have a much better decompression speed than lzo.
4. splittable indicates whether the compression format can be split , that is, whether random reads are supported. Whether compressed data can be used by mapreduce and whether compressed data can be segmented is critical.
There are now more than 4 compression formats used in Hadoop with LZO,GZIP,SNAPPY,BZIP2. The following is a comparison of the characteristics of 4 compression formats .

codec Implementation class

org.apache.hadoop.io.compress
CompressionCodecis the compression and decompression interface. The following is the implementation class for this interface.

Compressioncodec Method
Compressioncodec has two methods for compressing and decompressing
compression: Compressionoutputstream object obtained by Createoutputstream (OutputStream out) method
extract: get Compressionlnputstream object by Createlnputstream (InputStream in) method

Write the following example to compare
dd：用指定大小的块拷贝一个文件，并在拷贝的同时进行指定的转换。

Copy generates a +M's file [[email protected] liguodong]# DDif=/dev/zero of= data bs=1024k count=512Recorded the ++0The read-in record. ++0The writing536870912bytes537 MB) has been copied,0.557151Seconds964 MB/sec [[email protected] liguodong]# LLData-rw-r--r--1 root root 536870912 June 5 19:11 data[[email protected] liguodong]# pwd/liguodong[[email protected] liguodong]# lscodec. jar Data dir jni

Package Compress;Import Java. IO. FileInputStream;Import Java. IO. FileOutputStream;Import Java(i). IOException;import org. Apache. Hadoop. conf. Configuration;import org. Apache. Hadoop. IO. Ioutils;import org. Apache. Hadoop(i). Compress. Compressioncodec;import org. Apache. Hadoop. IO. Compress. Compressionoutputstream;import org. Apache. Hadoop. MapReduce. Job;import org. Apache. Hadoop. Util. Reflectionutils;public class Test {public static void main (string[] args) throws IOException, ClassNotFoundException {//1, configuring the Configuration Configuration = new configuration ();Job Job = Job. getinstance(Configuration,"Codec"); //2, package run the method job that must be performed. Setjarbyclass(Test. Class);String Codecclassname ="Org.apache.hadoop.io.compress.BZip2Codec";String Codecclassname ="Org.apache.hadoop.io.compress.GzipCodec";class<?> Clsclass = Class. forname(Codecclassname);Compressioncodec codec = (COMPRESSIONCODEC) reflectionutils. newinstance(Clsclass, configuration);String Inputfile ="/liguodong/data";String OutFile = Inputfile + codec. Getdefaultextension();//get default extensionFileOutputStream fileout = new FileOutputStream (outFile);Compressionoutputstream out= Codec. Createoutputstream(fileout);FileInputStreaminch= new FileInputStream (inputfile); Ioutils. Copybytes(inch, out,4096, false);        inch. Close();         out. Close(); }}

into Jar package : Codec.jar
Run

[[email protected] liguodong]# yarn Jar Codec.jar the/ ./ to  +: -:GenevaINFO bzip2. Bzip2factory:successfully Loaded & initialized NATIVE-BZIP2 library System-native[[email protected] Liguodong]# lsCodec. JarData data. BZ2 data. GZDir gzipcodec. JarJNI comparison [[email protected] liguodong]# llTotal dosage524824-rw-r--r--1Root root536870912 6Month5  +: Onedata-rw-r--r--1Root root402 6Month5  +: -Data. BZ2-rw-r--r--1Root root521844 6Month5  -: -Data. GZ

How do I choose a compression algorithm?

1. Use some file formats that contain compression and support splittable , such as Sequencefile,rcfile or Avro files.
2, using the compression format provided splittable, for example, BZIP2 and index can support splittable Lzo.
3, advance the file into several blocks, each block is compressed separately, so that there is no need to consider the problem of splittable.
4, do not compress files, in a compressed format that does not support splittable is not suitable for storing a large data file, non-local processing efficiency will be very low.

The compression algorithm for Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The compression algorithm of Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The compression algorithm of Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support