Consider how to compressMapreduceWhen processing data, it is important to consider whether the compression format supports segmentation. Consider not storing in HDFSCompressionThe size of the file is 1 GB, And the HDFS block size is 64 mb. Therefore, the file will be stored as 16, and the file will be used as the inputMapreduceA job creates a multipart (split ". Block is called block ".) Each part is processed separately as the input of an independent map task.
Now suppose. The. file is a grip compressed file with a size of 1 GB. As before,HDFSStore 16 files. However, creating a block for each block is useless, because it is impossible to read from any point in the gzip data stream, and the map task cannot read data in only one block independently from other blocks. In GZIP format, deflate is used to store compressed data. Deflate stores data as a series of compressed blocks. The problem is that the starting position of each block is not specified for the user to locate any point in the data stream to the starting position of the next block, but to synchronize itself with the data stream. Therefore, Gzip does not support the block splitting mechanism.
In this case, mapreduce does not split files in GZIP format because it knows that the input is in gzip compression format (known through the file extension), while the gzip compression mechanism does not support the splitting mechanism. At the cost of Localization: A map task will process 16 HDFS blocks. Most of them are not map local data. At the same time, because there are few map tasks, the granularity of job segmentation is not small enough, resulting in longer running time.
In our hypothetical example, if it is an lzo file, we will encounter the same problem, because the basic compression format does not provide a method for reader to synchronize it with the stream. However, the compressed file in Bzip2 format does provide the synchronization tag between blocks (a 48-bit PI approximation), so it supports the splitting mechanism. (PreviousWe listed whether each compression format supports segmentation .)
These problems may be slightly different for file collection. Zip is an archive format, so it can combine multiple files into one ZIP file. Each file is compressed separately, and all files are stored at the end of the ZIP file. This attribute indicates that the ZIP file supports splitting at the file boundary. Each part contains one or more files in the zip compressed file.
Hadoop CompressionAlgorithmAdvantages and disadvantages
When considering how to compress data that will be processed by mapreduce, it is important to consider whether the compression format supports segmentation. Consider the uncompressed files stored in HDFS. The size of the files is 1 GB, and the size of HDFS blocks is 64 mb. Therefore, the files will be stored as 16 blocks, the mapreduce job that uses this file as the input will create one input part (split, also known as "multipart ". Block is called block ".) Each part is processed separately as the input of an independent map task.
Now suppose. This file is a compressed file in GZIP format. The size after compression is 1 GB. As before, HDFS stores 16 files. However, creating a block for each block is useless, because it is impossible to read from any point in the gzip data stream, and the map task cannot read data in only one block independently from other blocks. In GZIP format, deflate is used to store compressed data. Deflate stores data as a series of compressed blocks. The problem is that the starting position of each block is not specified for the user to locate any point in the data stream to the starting position of the next block, but to synchronize itself with the data stream. Therefore, Gzip does not support the block splitting mechanism.
In this case, mapreduce does not split files in GZIP format because it knows that the input is in gzip compression format (known through the file extension), while the gzip compression mechanism does not support the splitting mechanism. At the cost of Localization: A map task will process 16 HDFS blocks. Most of them are not map local data. At the same time, because there are few map tasks, the granularity of job segmentation is not small enough, resulting in longer running time.
In our hypothetical example, if it is an lzo file, we will encounter the same problem, because the basic compression format does not provide a method for reader to synchronize it with the stream. However, the compressed file in Bzip2 format does provide the synchronization tag between blocks (a 48-bit PI approximation), so it supports the splitting mechanism.
These problems may be slightly different for file collection. Zip is an archive format, so it can combine multiple files into one ZIP file. Each file is compressed separately, and all files are stored at the end of the ZIP file. This attribute indicates that the ZIP file supports splitting at the file boundary. Each part contains one or more files in the zip compressed file.
The compression ratio and compression time of various compression algorithms in hadoop are as follows:
compression algorithm |
original file size |
compressed file size |
compression speed |
decompression speed |
gzip |
8.3 GB |
1.8 GB |
17.5 Mb/s |
58 Mb/s |
Bzip2 |
8.3 GB |
1.1 GB |
2.4 Mb/s |
9.5 Mb/s |
lzo-bset |
8.3 GB |
2 GB |
4 Mb/s |
60.6 Mb/s |
Lzo |
8.3 GB |
2.9 GB |
49.3 MB/S |
74.6 MB/S |
Which compression format should we use in mapreduce?
Determine the compression format based on the application's actual situation. Personally, is it more likely to use the fastest compression speed or the best space compression? Generally, different strategies should be tried and tested with representative datasets to find the best way. For large files without borders, such as log files, the following options are available.
Store uncompressed files.
Use a compression format that supports the split mechanism, such as Bzip2.
In an application, the file is divided into several large data blocks, and each data block is compressed separately using any supported compression format (you do not need to consider whether the compression format supports segmentation ). Here, you need to select the size of the data block so that the size of the compressed data block is equivalent to that of HDFS.
Use a sequence file that supports compression and segmentation ).
For large files, do not use an unsupported compression format for the entire file, because this will cause loss of local advantages, thus reducing the performance of mapreduce applications.
Hadoop supports splittable compression lzo
Using lzo Compression Algorithm in hadoop can reduce data size and data disk read/write time, store compressed data in HDFS, and enable the cluster to save more data, extend the service life of the cluster. In addition, because mapreduce jobs usually have Io bottlenecks, storing compressed data means less Io operations and more efficient job operation.
However, using compression on hadoop is also troublesome: first, some compression formats cannot be segmented and processed in parallel, such as gzip. Second, some other compression formats support block processing, but the decompression process is very slow, causing the bottleneck of the job to shift to the CPU, such as Bzip2.
If we can have a compression algorithm that can be segmented and processed in parallel, and the speed is also very fast, it is very ideal. This method is lzo.
The lzo compressed file is composed of many small blocks (about 256 K), so that hadoop jobs can be split by block. In addition, lzo has taken efficiency into account during design, and its decompression speed is twice that of gzip, which allows it to save a lot of disk read/write, the compression ratio is not as good as that of gzip. The compressed files are about half the size of gzip files, but this still saves 20%-50% of the storage space than the uncompressed files, in this way, the job execution speed can be greatly improved in terms of efficiency.
Hadoop lzo configuration referenceHttp://www.tech126.com/hadoop-lzo/
Local compression library
Considering the performance, it is best to use a local library to compress and decompress. For example, in a test, the local gzip compression library is used to reduce the pressure time by 50%, and the compression time is reduced by about 10% (compared with the built-in JAVA Implementation ). Table 4-4 shows the implementation of each compression format provided in Java and locally. Not all formats have local implementations (such as Bzip2 compression), while others only have local implementations (such as lzo ).
Compression format |
JAVA Implementation |
Local implementation |
Deflate |
Yes |
Yes |
Gzip |
Yes |
Yes |
Bzip2 |
Yes |
No |
Lzo |
No |
Yes |
Hadoop has a preset 32-bit and 64-bit Linux local compression library, which is located in the Library/local directory. For other platforms, You need to compile the database by yourself. For details, see hadoop Wikipedia http://wiki.apache.org/hadoop/nativehadoop.
The local library is used by using the java. Library. Path attribute of the Java System. The hadoop script has set this attribute in the bin directory, but if you do not use this script, you need to set the attribute in the application.
By default, hadoop searches for the local database on the platform where it runs, and automatically loads the database if it finds it. This means that you can use the local library without changing any configuration settings. In some cases, you may want to disable the local library, for example, when debugging compression problems. Therefore, set the attribute hadoop. Native. lib to false to ensure that the built-in Java equivalent built-in implementation is used (if they are available ).
Http://www.cnblogs.com/ggjucheng/archive/2012/04/22/2465580.html
Http://www.itivy.com/arch/archive/2011/12/10/hadoop-mapreduce-compression.html