Detailed description of hadoop's use of compression in mapreduce

Last Update:2018-12-07 Source: Internet

Author: User

Tags configuration settings

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop's support for compressed files

Hadoop supports transparent identification of compression formats, and execution of our mapreduce tasks is transparent. hadoop can automatically decompress the compressed files for us without worrying about them.

If the compressed file has an extension (such as lzo, GZ, and Bzip2) of the corresponding compression format, hadoop selects the decoder to decompress the file based on the extension.

Hadoop supports each compression format. For details, see the following table:

Compression format	Tools	Algorithm	File Extension	Multiple files	Severability
Deflate	None	Deflate	. Deflate	No	No
Gzip	Gzip	Deflate	. GZ	No	No
Zip	Zip	Deflate	. Zip	Yes	Yes, within the file range
Bzip2	Bzip2	Bzip2	. Bz2	No	Yes
Lzo	Lzop	Lzo	. Lzo	No	Yes

If the compressed file does not have an extension, you must specify the input format when executing the mapreduce task.

Hadoop JAR/usr/home/hadoop/hadoop-0.20.2/contrib/streaming/hadoop-streaming-0.20.2-CD h3b4. Jar-File/Usr/home/hadoop/Hello/Mapper. py-mapper/usr/home/hadoop/Hello/Mapper. py-File/Usr/home/hadoop/Hello/CER Cer. py-CER/usr/home/hadoop/Hello/reducer. py-input lzotest-output result4-jobconf mapred. reduce. tasks = 1 *-inputformat Org. apache. hadoop. mapred. lzotextinputformat *

The compression ratio and compression time of various compression algorithms in hadoop are as follows:

Compression Algorithm	Original file size	Size of the compressed file	Compression speed	Decompression speed
Gzip	8.3 GB	1.8 GB	17.5 MB/S	58 MB/S
Bzip2	8.3 GB	1.1 GB	2.4 MB/S	9.5 MB/S
Lzo-bset	8.3 GB	2 GB	4 MB/S	60.6 MB/S
Lzo	8.3 GB	2.9 GB	49.3 MB/S	74.6 MB/S

Advantages and disadvantages of various hadoop Compression Algorithms

When considering how to compress data that will be processed by mapreduce, it is important to consider whether the compression format supports segmentation. Consider the uncompressed files stored in HDFS. The size of the files is 1 GB, and the size of HDFS blocks is 64 mb. Therefore, the files will be stored as 16 blocks, the mapreduce job that uses this file as the input will create one input part (split, also known as "multipart ". Block is called block ".) Each part is processed separately as the input of an independent map task.

Now suppose. This file is a compressed file in GZIP format. The size after compression is 1 GB. As before, HDFS stores 16 files. However, creating a block for each block is useless, because it is impossible to read from any point in the gzip data stream, and the map task cannot read data in only one block independently from other blocks. In GZIP format, deflate is used to store compressed data. Deflate stores data as a series of compressed blocks. The problem is that the starting position of each block is not specified for the user to locate any point in the data stream to the starting position of the next block, but to synchronize itself with the data stream. Therefore, Gzip does not support the block splitting mechanism.

In this case, mapreduce does not split files in GZIP format because it knows that the input is in gzip compression format (known through the file extension), while the gzip compression mechanism does not support the splitting mechanism. At the cost of Localization: A map task will process 16 HDFS blocks. Most of them are not map local data. At the same time, because there are few map tasks, the granularity of job segmentation is not small enough, resulting in longer running time.

In our hypothetical example, if it is an lzo file, we will encounter the same problem, because the basic compression format does not provide a method for reader to synchronize it with the stream. However, the compressed file in Bzip2 format does provide the synchronization tag between blocks (a 48-bit PI approximation), so it supports the splitting mechanism.

These problems may be slightly different for file collection. Zip is an archive format, so it can combine multiple files into one ZIP file. Each file is compressed separately, and all files are stored at the end of the ZIP file. This attribute indicates that the ZIP file supports splitting at the file boundary. Each part contains one or more files in the zip compressed file.

Which compression format should we use in mapreduce?

Determine the compression format based on the application's actual situation. Personally, is it more likely to use the fastest compression speed or the best space compression? Generally, different strategies should be tried and tested with representative datasets to find the best way. For large files without borders, such as log files, the following options are available.

Store uncompressed files.

Use a compression format that supports the split mechanism, such as Bzip2.

In an application, the file is divided into several large data blocks, and each data block is compressed separately using any supported compression format (you do not need to consider whether the compression format supports segmentation ). Here, you need to select the size of the data block so that the size of the compressed data block is equivalent to that of HDFS.

Use a sequence file that supports compression and segmentation ).

For large files, do not use an unsupported compression format for the entire file, because this will cause loss of local advantages, thus reducing the performance of mapreduce applications.

Hadoop supports splittable compression lzo

Using lzo Compression Algorithm in hadoop can reduce data size and data disk read/write time, store compressed data in HDFS, and enable the cluster to save more data, extend the service life of the cluster. In addition, because mapreduce jobs usually have Io bottlenecks, storing compressed data means less Io operations and more efficient job operation.

However, using compression on hadoop is also troublesome: first, some compression formats cannot be segmented and processed in parallel, such as gzip. Second, some other compression formats support block processing, but the decompression process is very slow, causing the bottleneck of the job to shift to the CPU, such as Bzip2.

If we can have a compression algorithm that can be segmented and processed in parallel, and the speed is also very fast, it is very ideal. This method is lzo.

The lzo compressed file is composed of many small blocks (about 256 K), so that hadoop jobs can be split by block. In addition, lzo has taken efficiency into account during design, and its decompression speed is twice that of gzip, which allows it to save a lot of disk read/write, the compression ratio is not as good as that of gzip. The compressed files are about half the size of gzip files, but this still saves 20%-50% of the storage space than the uncompressed files, in this way, the job execution speed can be greatly improved in terms of efficiency.

Hadoop lzo configuration document reference http://www.tech126.com/hadoop-lzo/

How to Use compression in mapreduce

1. Compression of input files

If the input files are compressed, they will be automatically decompressed when they are read by mapreduce. Based on the file extension, you can determine which compression decoder to use.

2. Compression of mapreduce job output

To compress the output of a mapreduce job, set the mapred. Output. Compress attribute to true in the job configuration file. Set the mapred. Output. Compression. codec attribute to the Class Name of the compression encoding/decoder you intend to use.

If a series of files are used for the output, you can set the mapred. Output. Compression. Type attribute to control the compression type. The default value is record, which compresses individual records. If you change it to block, you can compress a group of records. It is recommended because it has a better compression ratio.

3. Compression of map job output results

Even if mapreduce applications use non-compressed data to read and write data, we can also benefit from the intermediate output in the compression map stage. Because the output of the map job is written to the disk and transmitted to the CER node through the network, if you use fast compression such as lzo, you can get better performance, because the amount of transmitted data is greatly reduced. BelowCodeDisplays the configuration attributes for enabling output compression and setting compression formats.

 Conf. setcompressmapoutput (True); Conf. setmapoutputcompressorclass (gzipcodec.Class);

Local compression library

Considering the performance, it is best to use a local library to compress and decompress. For example, in a test, the local gzip compression library is used to reduce the pressure time by 50%, and the compression time is reduced by about 10% (compared with the built-in JAVA Implementation ). Table 4-4 shows the implementation of each compression format provided in Java and locally. Not all formats have local implementations (such as Bzip2 compression), while others only have local implementations (such as lzo ).

Compression format	JAVA Implementation	Local implementation
Deflate	Yes	Yes
Gzip	Yes	Yes
Bzip2	Yes	No
Lzo	No	Yes

Hadoop has a preset 32-bit and 64-bit Linux local compression library, which is located in the Library/local directory. For other platforms, You need to compile the database by yourself. For details, see hadoop Wikipedia http://wiki.apache.org/hadoop/nativehadoop.

The local library is used by using the java. Library. Path attribute of the Java System. The hadoop script has set this attribute in the bin directory, but if you do not use this script, you need to set the attribute in the application.

By default, hadoop searches for the local database on the platform where it runs, and automatically loads the database if it finds it. This means that you can use the local library without changing any configuration settings. In some cases, you may want to disable the local library, for example, when debugging compression problems. Therefore, set the attribute hadoop. Native. lib to false to ensure that the built-in Java equivalent built-in implementation is used (if they are available ).

References: http://groups.google.com/group/hadoopors/browse_thread/thread/1765ac12cb3a584a

Http://www.itivy.com/arch/archive/2011/12/10/hadoop-codec-usage.html

Http://www.itivy.com/arch/archive/2011/12/10/hadoop-mapreduce-compression.html

Http://hi.chinaunix.net /? Uid-9976001-action-viewspace-itemid-45130

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More