Hadoop and HDFS data compression format

Last Update:2016-12-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. General criteria for Cloudera data compression

General guidelines

Whether data is compressed and which compression format is used has a significant impact on performance. The most important two aspects to consider in data compression are the MapReduce jobs and the data stored in HBase. In most cases, each principle is similar.

You need to balance the power required to compress and decompress data, the disk IO required to read and write data, and the network bandwidth required to send data over the network. Properly balancing these factors depends on the characteristics of your cluster and data, and your

Usage mode.

Compression is not recommended if the data is compressed (for example, images in JPEG format). In fact, the resulting file may actually be larger than the original file.
GZIP compression uses more CPU resources than Snappy or LZO, but can provide a higher compression ratio. GZIP is usually a good choice for infrequently accessed cold data. Snappy or LZO are more suitable for frequently accessed thermal data.
BZIP2 can also generate more compression for some file types than GZip, but compression and decompression will affect speed to some extent. HBase does not support BZIP2 compression.
Snappy usually perform better than LZO. You should run tests to see if you detect a noticeable difference.
For MapReduce, if you need the compressed data to be split, the BZIP2, LZO, and Snappy formats can be split, but GZIP is not available. The scalability is independent of HBase data.
For MapReduce, you can compress intermediate data, output, or both. Adjust the parameters that you provide for the MapReduce job accordingly. The following example compresses intermediate data and output. The MR2 is displayed first, and then the MR1 is displayed.

Mr2hadoop jar Hadoop-examples-.jar Sort"-dmapreduce.compress.map.output=true" "  -dmapreduce.map.output.compression.codec= Org.apache.hadoop.io.compress.GzipCodec ""-dmapreduce.output.compress=true ""  -dmapreduce.output.compress Ion.codec=org.apache.hadoop.io.compress.gzipcodec "-outkey Org.apache.hadoop.io.text-outvalue Org.apache.hadoop.io.Text input outputmr1hadoop jar Hadoop-examples-.jar sort "-dmapred.compress.map.output=true ""-dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.gzipcodec ""  - Dmapred.output.compress=true " "-dmapred.output.compression.codec=org.apache.hadoop.io.compress.gzipcodec "- Outkey org.apache.hadoop.io.text-outvalue org.apache.hadoop.io.Text Input Output

2, Hadoop compression Implementation analysis

Introduction to Compression

Hadoop as a more general data processing platform, each operation will need to process a large amount of information, we will be in the Hadoop system to compress the data to optimize disk utilization, improve the speed of data in the disk and network transmission, so as to improve the efficiency of the system processing data. In the use of compression, the main consideration is the compression speed and the severability of the compressed file. As described in the synthesis, the advantages of using compression are as follows:

Save disk space for data consumption;

Speed up the transfer of data over disk and network, thus increasing the processing speed of the system.

Compression format

Hadoop is automatically recognized for compressed formats. If we compress the file has the corresponding compression format extension (such as LZO,GZ,BZIP2, etc.).
Hadoop automatically selects the corresponding decoder according to the extension of the compressed format to extract the data, this process is completely Hadoop automatic processing, we only need to ensure that the input compressed file has an extension.
Hadoop supports each compression format in detail in the following table:

compression Format	Tools	algorithm	Extended Name	Multiple Files	severability
DEFLATE	No	DEFLATE	. Deflate	No	No
Gzip	Gzip	DEFLATE	. gzp	No	No
Zip	Zip	DEFLATE	. zip	Is	Yes, within the scope of the file
BZIP2	Bzip2	BZIP2	. bz2	No	Is
LZO	Lzop	LZO	. Lzo	No	Is

If the compressed file does not have an extension, you will need to specify the input format when performing the MapReduce task.

hadoop jar /usr/home/hadoop/hadoop-0.20.2/contrib/streaming/  hadoop-streaming-0.20.2-CD H3B4.jar -file /usr/home/hadoop/hello/mapper.py -mapper /  usr/home/hadoop/hello/mapper.py -file /usr/home/hadoop/hello/  reducer.py -reducer /usr/home/hadoop/hello/reducer.py -input lzotest -output result4 -  jobconf mapred.reduce.tasks=1*-inputformatorg.apache.hadoop.mapred.LzoTextInputFormat*

Performance comparison

The compression ratio, compression time and decompression time of various compression algorithms under Hadoop are shown in the following table:

compression Algorithm	Original File Size	Compressed File size	Compression Speed	Decompression Speed
Gzip	8.3GB	1.8GB	17.5mb/s	58mb/s
Bzip2	8.3GB	1.1GB	2.4mb/s	9.5mb/s
Lzo-bset	8.3GB	2GB	4mb/s	60.6mb/s
LZO	8.3GB	2.9GB	49.3mb/s	74.6mb/s

So we can draw:
1) BZIP2 compression effect is obviously the best, but bzip2 compression speed is slow, can be divided.
2) Gzip compression effect is inferior to BZIP2, but compression decompression speed is fast, do not support segmentation.
3) LZO compression effect than BZIP2 and Gzip, but compression decompression speed fastest! and Support Segmentation!

In this case, the severability of the file is very important in Hadoop, which affects the number of MAP launches when the job is executed, which can affect the execution efficiency of the job!

All compression algorithms show a time-space tradeoff, and faster compression and decompression speeds often consume more space. When choosing which compression format to use, we should choose according to our business needs.

3. Application of 4 kinds of compression formats used in Hadoop

Currently in Hadoop used more than lzo,gzip,snappy,bzip2 these 4 kinds of compression format, the author based on practical experience to introduce the advantages and disadvantages of these 4 compression formats and application scenarios, so that we in practice according to the actual situation to choose different compression format.

1.gzip compression

Advantages:
1. Compression ratio is high, and compression/decompression speed is relatively fast;
2. Hadoop natively supports the process of processing files in gzip format in an application just as it does with text;
3. There is a Hadoop native library;
4. Most Linux systems have their own GZIP commands and are easy to use.
Cons: Split is not supported.
Application Scenarios:
1. When each file is compressed within 130M (1 blocks in size), you can consider using gzip compression format. For example, a day or one-hour log is compressed into a gzip file, and the MapReduce program runs through multiple gzip files to achieve concurrency.
2. The hive program, the streaming program, and the Java-written MapReduce program are exactly the same as text processing, and the original program does not need to be modified after compression.

2.Lzo Compression

Advantages:
1. Compression/decompression speed is relatively fast, reasonable compression rate;
2. Support Split, is the most popular compression format in Hadoop;
3. Support Hadoop native Library;
4. The LZOP command can be installed under the Linux system and is easy to use.
Disadvantages:
1. Compression ratio is lower than gzip;
2. Hadoop itself is not supported and needs to be installed;
3. Some special processing is required for files in the Lzo format in the application (in order to support split needs to be indexed, you also need to specify InputFormat as the Lzo format).
Application Scenarios:
A large text file, after compression is more than 200M can be considered, and the larger the single file, the more obvious advantages of Lzo.

3.Snappy Compression

Advantages:
1. High speed compression speed and reasonable compression ratio;
2. Support for Hadoop native libraries.
Disadvantages:
1. Split is not supported;
2. Compression ratio is lower than gzip;
3. Hadoop itself is not supported and needs to be installed;
4. There are no commands for the Linux system.
Application Scenarios:
1. When the map output data of the MapReduce job is relatively large, it is used as the compressed format of the intermediate data of map to reduce;
2. Or as an output of a mapreduce job and an input to another mapreduce job.

4.bzip2 Compression

Advantages:
1. Support split;
2. It has high compression ratio, which is higher than gzip compression ratio.
3. Hadoop itself supports, but does not support native;
4. It is easy to use the BZIP2 command under the Linux system.
Disadvantages:
1. Compression/decompression speed is slow;
2. Native is not supported.
Application Scenarios:
1. Suitable for the speed requirement is not high, but need high compression rate, can be used as the output format of the MapReduce job;
2. Or the data after the output is larger, the data after processing needs to compress the archive to reduce disk space and later data used less;
3. Or you want to compress a large text file to reduce storage space, but also need to support split, and compatible with the previous application (that is, the application does not need to modify) situation.

5.comparison of the characteristics of 4 compression formats

compression Format	Split	native	Compression ratio	Speed	whether Hadoop comes with	Linux Commands	if the original application has to be modified after you change to a compressed format
Gzip	Whether	Is	Very high	Relatively fast	Yes, direct use	Yes	As with text processing, there is no need to modify
Lzo	Is	Is	higher than	Soon	No, you need to install	Yes	You need to build an index and specify the input format
Snappy	Whether	Is	higher than	Soon	No, you need to install	No	As with text processing, there is no need to modify
Bzip2	Is	Whether	Highest	Slow	Yes, direct use	Yes	As with text processing, there is no need to modify

At present CDH cluster is generally optional installation Hadoop_lzo,ucloud cluster is now integrated with Lzo

Hadoop and HDFS data compression format

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop and HDFS data compression format

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop and HDFS data compression format

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support