Hadoop and HDFS data compression format

Source: Internet
Author: User

1. General criteria for Cloudera data compression

General guidelines

  • Whether data is compressed and which compression format is used has a significant impact on performance. The most important two aspects to consider in data compression are the MapReduce jobs and the data stored in HBase. In most cases, each principle is similar.
  • You need to balance the power required to compress and decompress data, the disk IO required to read and write data, and the network bandwidth required to send data over the network. Properly balancing these factors depends on the characteristics of your cluster and data, and your
  • Usage mode.
    • Compression is not recommended if the data is compressed (for example, images in JPEG format). In fact, the resulting file may actually be larger than the original file.
    • GZIP compression uses more CPU resources than Snappy or LZO, but can provide a higher compression ratio. GZIP is usually a good choice for infrequently accessed cold data. Snappy or LZO are more suitable for frequently accessed thermal data.
    • BZIP2 can also generate more compression for some file types than GZip, but compression and decompression will affect speed to some extent. HBase does not support BZIP2 compression.
    • Snappy usually perform better than LZO. You should run tests to see if you detect a noticeable difference.
    • For MapReduce, if you need the compressed data to be split, the BZIP2, LZO, and Snappy formats can be split, but GZIP is not available. The scalability is independent of HBase data.
    • For MapReduce, you can compress intermediate data, output, or both. Adjust the parameters that you provide for the MapReduce job accordingly. The following example compresses intermediate data and output. The MR2 is displayed first, and then the MR1 is displayed.
Mr2hadoop jar Hadoop-examples-.jar Sort"-dmapreduce.compress.map.output=true" "  -dmapreduce.map.output.compression.codec= Org.apache.hadoop.io.compress.GzipCodec ""-dmapreduce.output.compress=true ""  -dmapreduce.output.compress Ion.codec=org.apache.hadoop.io.compress.gzipcodec "-outkey Org.apache.hadoop.io.text-outvalue Org.apache.hadoop.io.Text input outputmr1hadoop jar Hadoop-examples-.jar sort "-dmapred.compress.map.output=true ""-dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.gzipcodec ""  - Dmapred.output.compress=true " "-dmapred.output.compression.codec=org.apache.hadoop.io.compress.gzipcodec "- Outkey org.apache.hadoop.io.text-outvalue org.apache.hadoop.io.Text Input Output      
2, Hadoop compression Implementation analysis

Introduction to Compression

Hadoop as a more general data processing platform, each operation will need to process a large amount of information, we will be in the Hadoop system to compress the data to optimize disk utilization, improve the speed of data in the disk and network transmission, so as to improve the efficiency of the system processing data. In the use of compression, the main consideration is the compression speed and the severability of the compressed file. As described in the synthesis, the advantages of using compression are as follows:

  1. Save disk space for data consumption;
  2. Speed up the transfer of data over disk and network, thus increasing the processing speed of the system.

Compression format

    1. Hadoop is automatically recognized for compressed formats. If we compress the file has the corresponding compression format extension (such as LZO,GZ,BZIP2, etc.).
    2. Hadoop automatically selects the corresponding decoder according to the extension of the compressed format to extract the data, this process is completely Hadoop automatic processing, we only need to ensure that the input compressed file has an extension.
    3. Hadoop supports each compression format in detail in the following table:
compression Format Tools algorithm Extended Name Multiple Files severability
DEFLATE No DEFLATE . Deflate No No
Gzip Gzip DEFLATE . gzp No No
Zip Zip DEFLATE . zip Is Yes, within the scope of the file
BZIP2 Bzip2 BZIP2 . bz2 No Is
LZO Lzop LZO . Lzo No Is
    1. If the compressed file does not have an extension, you will need to specify the input format when performing the MapReduce task.
hadoop jar /usr/home/hadoop/hadoop-0.20.2/contrib/streaming/  hadoop-streaming-0.20.2-CD H3B4.jar -file /usr/home/hadoop/hello/mapper.py -mapper /  usr/home/hadoop/hello/mapper.py -file /usr/home/hadoop/hello/  reducer.py -reducer /usr/home/hadoop/hello/reducer.py -input lzotest -output result4 -  jobconf mapred.reduce.tasks=1*-inputformatorg.apache.hadoop.mapred.LzoTextInputFormat*

Performance comparison

    1. The compression ratio, compression time and decompression time of various compression algorithms under Hadoop are shown in the following table:
compression Algorithm Original File Size Compressed File size Compression Speed Decompression Speed
Gzip 8.3GB 1.8GB 17.5mb/s 58mb/s
Bzip2 8.3GB 1.1GB 2.4mb/s 9.5mb/s
Lzo-bset 8.3GB 2GB 4mb/s 60.6mb/s
LZO 8.3GB 2.9GB 49.3mb/s 74.6mb/s

So we can draw:
1) BZIP2 compression effect is obviously the best, but bzip2 compression speed is slow, can be divided.
2) Gzip compression effect is inferior to BZIP2, but compression decompression speed is fast, do not support segmentation.
3) LZO compression effect than BZIP2 and Gzip, but compression decompression speed fastest! and Support Segmentation!

In this case, the severability of the file is very important in Hadoop, which affects the number of MAP launches when the job is executed, which can affect the execution efficiency of the job!

All compression algorithms show a time-space tradeoff, and faster compression and decompression speeds often consume more space. When choosing which compression format to use, we should choose according to our business needs.

3. Application of 4 kinds of compression formats used in Hadoop

Currently in Hadoop used more than lzo,gzip,snappy,bzip2 these 4 kinds of compression format, the author based on practical experience to introduce the advantages and disadvantages of these 4 compression formats and application scenarios, so that we in practice according to the actual situation to choose different compression format.

1.gzip compression

    • Advantages:
      1. Compression ratio is high, and compression/decompression speed is relatively fast;
      2. Hadoop natively supports the process of processing files in gzip format in an application just as it does with text;
      3. There is a Hadoop native library;
      4. Most Linux systems have their own GZIP commands and are easy to use.
    • Cons: Split is not supported.
    • Application Scenarios:
      1. When each file is compressed within 130M (1 blocks in size), you can consider using gzip compression format. For example, a day or one-hour log is compressed into a gzip file, and the MapReduce program runs through multiple gzip files to achieve concurrency.
      2. The hive program, the streaming program, and the Java-written MapReduce program are exactly the same as text processing, and the original program does not need to be modified after compression.

2.Lzo Compression

    • Advantages:
      1. Compression/decompression speed is relatively fast, reasonable compression rate;
      2. Support Split, is the most popular compression format in Hadoop;
      3. Support Hadoop native Library;
      4. The LZOP command can be installed under the Linux system and is easy to use.
    • Disadvantages:
      1. Compression ratio is lower than gzip;
      2. Hadoop itself is not supported and needs to be installed;
      3. Some special processing is required for files in the Lzo format in the application (in order to support split needs to be indexed, you also need to specify InputFormat as the Lzo format).
    • Application Scenarios:
      A large text file, after compression is more than 200M can be considered, and the larger the single file, the more obvious advantages of Lzo.

3.Snappy Compression

    • Advantages:
      1. High speed compression speed and reasonable compression ratio;
      2. Support for Hadoop native libraries.
    • Disadvantages:
      1. Split is not supported;
      2. Compression ratio is lower than gzip;
      3. Hadoop itself is not supported and needs to be installed;
      4. There are no commands for the Linux system.
    • Application Scenarios:
      1. When the map output data of the MapReduce job is relatively large, it is used as the compressed format of the intermediate data of map to reduce;
      2. Or as an output of a mapreduce job and an input to another mapreduce job.

4.bzip2 Compression

    • Advantages:
      1. Support split;
      2. It has high compression ratio, which is higher than gzip compression ratio.
      3. Hadoop itself supports, but does not support native;
      4. It is easy to use the BZIP2 command under the Linux system.
    • Disadvantages:
      1. Compression/decompression speed is slow;
      2. Native is not supported.
    • Application Scenarios:
      1. Suitable for the speed requirement is not high, but need high compression rate, can be used as the output format of the MapReduce job;
      2. Or the data after the output is larger, the data after processing needs to compress the archive to reduce disk space and later data used less;
      3. Or you want to compress a large text file to reduce storage space, but also need to support split, and compatible with the previous application (that is, the application does not need to modify) situation.

5.comparison of the characteristics of 4 compression formats

compression Format Split native Compression ratio Speed whether Hadoop comes with Linux Commands if the original application has to be modified after you change to a compressed format
Gzip Whether Is Very high Relatively fast Yes, direct use Yes As with text processing, there is no need to modify
Lzo Is Is higher than Soon No, you need to install Yes You need to build an index and specify the input format
Snappy Whether Is higher than Soon No, you need to install No As with text processing, there is no need to modify
Bzip2 Is Whether Highest Slow Yes, direct use Yes As with text processing, there is no need to modify

At present CDH cluster is generally optional installation Hadoop_lzo,ucloud cluster is now integrated with Lzo

Hadoop and HDFS data compression format

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.