The compression algorithm of Hadoop

Source: Internet
Author: User

compression of common data compression algorithms

There are two main advantages of file compression, one is to reduce the space for storing files, and the other is to speed up data transmission. In the context of Hadoop big data, these two points are especially important, so I'm going to look at the file compression in Hadoop. There are many compression formats supported in Hadoop, so let's look at a table:

The Lzo and LZ4 algorithms have not been used in hadoop1.x.

1, deflate is a lossless data compression algorithm using both LZ77 and Huffman codes,
The source code can be found in the zlib library . Gzip is an algorithm based on the DEFLATE algorithm.
2, compression algorithm is the conversion of space and time, faster compression time or a smaller compression ratio. Can be specified by parameters, 1 means speed, 9 means space.
Taking gzip as an example, the following means a faster retraction: gzip -1 file
3, gzip in time and space is relatively moderate, bzip2 compression than gzip more effective, but slower. Bzip2 's decompression speed is faster than it compresses. However, compared with other compression formats is the slowest, but the compression effect is clearly the best . Snappy and LZ4 have a much better decompression speed than lzo.
4. splittable indicates whether the compression format can be split , that is, whether random reads are supported. Whether compressed data can be used by mapreduce and whether compressed data can be segmented is critical.
There are now more than 4 compression formats used in Hadoop with LZO,GZIP,SNAPPY,BZIP2. The following is a comparison of the characteristics of 4 compression formats .

codec Implementation class

org.apache.hadoop.io.compress
CompressionCodecis the compression and decompression interface. The following is the implementation class for this interface.

Compressioncodec Method
Compressioncodec has two methods for compressing and decompressing
compression: Compressionoutputstream object obtained by Createoutputstream (OutputStream out) method
extract: get Compressionlnputstream object by Createlnputstream (InputStream in) method

Write the following example to compare
dd:用指定大小的块拷贝一个文件,并在拷贝的同时进行指定的转换。

Copy generates a +M's file [[email protected] liguodong]# DDif=/dev/zero of= data bs=1024k count=512Recorded the ++0The read-in record. ++0The writing536870912bytes537 MB) has been copied,0.557151Seconds964 MB/sec [[email protected] liguodong]# LLData-rw-r--r--1 root root 536870912 June 5 19:11 data[[email protected] liguodong]# pwd/liguodong[[email protected] liguodong]# lscodec. jar Data dir jni
Package Compress;Import Java. IO. FileInputStream;Import Java. IO. FileOutputStream;Import Java(i). IOException;import org. Apache. Hadoop. conf. Configuration;import org. Apache. Hadoop. IO. Ioutils;import org. Apache. Hadoop(i). Compress. Compressioncodec;import org. Apache. Hadoop. IO. Compress. Compressionoutputstream;import org. Apache. Hadoop. MapReduce. Job;import org. Apache. Hadoop. Util. Reflectionutils;public class Test {public static void main (string[] args) throws IOException, ClassNotFoundException {//1, configuring the Configuration Configuration = new configuration ();Job Job = Job. getinstance(Configuration,"Codec"); //2, package run the method job that must be performed. Setjarbyclass(Test. Class);String Codecclassname ="Org.apache.hadoop.io.compress.BZip2Codec";String Codecclassname ="Org.apache.hadoop.io.compress.GzipCodec";class<?> Clsclass = Class. forname(Codecclassname);Compressioncodec codec = (COMPRESSIONCODEC) reflectionutils. newinstance(Clsclass, configuration);String Inputfile ="/liguodong/data";String OutFile = Inputfile + codec. Getdefaultextension();//get default extensionFileOutputStream fileout = new FileOutputStream (outFile);Compressionoutputstream out= Codec. Createoutputstream(fileout);FileInputStreaminch= new FileInputStream (inputfile); Ioutils. Copybytes(inch, out,4096, false);        inch. Close();         out. Close(); }}

into Jar package : Codec.jar
Run

[[email protected] liguodong]# yarn Jar Codec.jar the/ ./ to  +: -:GenevaINFO bzip2. Bzip2factory:successfully Loaded & initialized NATIVE-BZIP2 library System-native[[email protected] Liguodong]# lsCodec. JarData data. BZ2 data. GZDir gzipcodec. JarJNI comparison [[email protected] liguodong]# llTotal dosage524824-rw-r--r--1Root root536870912 6Month5  +: Onedata-rw-r--r--1Root root402 6Month5  +: -Data. BZ2-rw-r--r--1Root root521844 6Month5  -: -Data. GZ
How do I choose a compression algorithm?

1. Use some file formats that contain compression and support splittable , such as Sequencefile,rcfile or Avro files.
2, using the compression format provided splittable, for example, BZIP2 and index can support splittable Lzo.
3, advance the file into several blocks, each block is compressed separately, so that there is no need to consider the problem of splittable.
4, do not compress files, in a compressed format that does not support splittable is not suitable for storing a large data file, non-local processing efficiency will be very low.

The compression algorithm for Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.