Hadoop support for compressed files and the advantages and disadvantages of algorithms

Source: Internet
Author: User
Tags configuration settings

Hadoop support for compressed files and the advantages and disadvantages of algorithms  

Hadoop is transparent to the compressed format, our MapReduce task is transparent, and Hadoop automatically extracts the compressed files for us without our care.

If we compress the file with the appropriate compression format extension (such as LZO,GZ,BZIP2, etc.), Hadoop will select the decoder according to the extension to extract.

Compression format Tools Algorithm File name extension Multiple files Severability
DEFLATE No DEFLATE . Deflate No No
Gzip Gzip DEFLATE . gz No No
Zip Zip DEFLATE . zip Is Yes, within the scope of the file
Bzip2 Bzip2 Bzip2 . bz2 No Is
LZO Lzop LZO . Lzo No Is

  

If the compressed file does not have an extension, you will need to specify the input format when performing the MapReduce task.

Hadoop JAR/USR/HOME/HADOOP/HADOOP-0.20.2/CONTRIB/STREAMING/HADOOP-STREAMING-0.20.2-CD H3B4.jar-
file/usr/home/hadoop/hello/mapper.py-
mapper/usr/home/hadoop/hello/mapper.py-
file/usr/home/hadoop/hello/reducer.py-
Reducer/usr/home/hadoop/hello/reducer.py-input lzotest-output result4-jobconf Mapred.reduce.tasks=1 *-inputformat org.apache.hadoop.mapred.lzotextinputformat*

The compression ratio, compression time and decompression time of various compression algorithms under Hadoop are shown in the following table:

Compression algorithm Original File size Compressed file size Compression speed Decompression speed
Gzip 8.3GB 1.8GB 17.5mb/s 58mb/s
Bzip2 8.3GB 1.1GB 2.4mb/s 9.5mb/s
Lzo-bset 8.3GB 2GB 4mb/s 60.6mb/s
LZO 8.3GB 2.9GB 49.3mb/s 74.6mb/s

  Summary of advantages and disadvantages of Hadoop various compression algorithms

When considering how to compress data that will be processed by MapReduce, it is important to consider whether the compression format supports partitioning. Consider an uncompressed file stored in HDFs with a block size of 1gb,hdfs of 64MB, so the file will be stored as 16 blocks, and the MapReduce job using this file as input will create 1 human shards (split, also known as "chunking"). For block, we are collectively called "Blocks". Each shard is individually processed as input to a separate map task.

Now suppose. The file is a compressed file in gzip format with a compressed size of 1GB. As before, HDFs stores this file as 16 blocks. However, it is useless to create a tile for each block, since it is not possible to read from any point in the gzip data stream, nor can the map task read only one chunk of data independently of the other tiles. The gzip format uses deflate to store compressed data, and deflate stores the data as a series of compressed blocks. The problem is that the start of each block does not specify that the user is positioned at any point in the data flow to the starting position of the next block, but rather that it synchronizes itself with the data stream. Therefore, Gzip does not support the split (block) mechanism.

In this case, MapReduce does not split the gzip format file because it knows that the input is in gzip compressed format (known by the file extension), and the gzip compression mechanism does not support the tessellation mechanism. This is at the expense of localization: A map task will handle 16 HDFs blocks. Most are not local data for map. At the same time, because the map task is small, the granularity of the job segmentation is not fine, resulting in a longer run time.

In our hypothetical example, if it is a file in a Lzo format, we will encounter the same problem because the basic compression format does not provide the method for reader to synchronize with the stream. However, a compressed file in the BZIP2 format does provide a synchronous tag between blocks and blocks (a 48-bit pi approximation), so it supports the tessellation mechanism.

These issues are slightly different for the collection of files. Zip is an archive format, so it can combine multiple files into a single zip file. Each file is compressed separately, and the storage location of all the documents is stored at the tail end of the zip file. This property indicates that the zip file supports segmentation at the file boundaries, and each shard includes one or more files in the zip archive.

  What kind of compression format should we use in MapReduce

Determine which compression format should be used, depending on the application's specific circumstances. Personally, are you more inclined to use the fastest speed compression or the best space compression? In general, you should try different strategies and test them with representative datasets to find the best approach. For large, borderless files, such as log files, the following options are available.

Stores the uncompressed files.

Use a compression format that supports the split mechanism, such as bzip2.

Split the file into several large chunks in the app, and then compress each block of data separately using any of the supported compression formats (regardless of whether the compression format supports splitting). Here, you need to select the size of the data block to make the compressed data block equal to the size of the HDFS block.

Use sequence file (sequence files) that support compression and partitioning.

For large files, do not use a compressed format that does not support fragmentation for the entire file, because it would lose the local advantage, which would reduce the performance of the MapReduce application.

  Hadoop supports splittable compression Lzo

Using Lzo compression algorithm in Hadoop can reduce the size of data and the disk read and write time of data, store compressed data in HDFs, so that the cluster can save more data and prolong the service life of the cluster. Not only that, because mapreduce jobs are usually bottlenecks on Io, storing compressed data means less IO operations and more efficient job operations.

But using compression on Hadoop also has two more troublesome places: first, some compression formats cannot be chunked, parallel processing, such as gzip. Second, while some other compression formats support chunked processing, the decompression process is slow and the job bottleneck is transferred to the CPU, such as bzip2.

If you can have a compression algorithm, that can be chunked, parallel processing, the speed is very fast, it is very ideal. This approach is lzo.

Lzo's compressed files are made up of a number of small blocks (about 256K), so that the Hadoop job can split the job according to block partitioning. Not only that, lzo in the design of the efficiency of the problem, its decompression speed is gzip twice times, which allows it to save a lot of disk read and write, it is less than the compression ratio of gzip, about the compressed file than gzip compression half, but this is still more than the uncompressed file to save 20%- 50% of the storage space, so that the efficiency of the job can greatly improve the speed of execution.

Hadoop under Lzo Configuration document reference http://www.tech126.com/hadoop-lzo/

  How to use compression in MapReduce

1. Compression of the input files

If the input files are compressed, they will be automatically decompressed when they are read by MapReduce, depending on the file extension to determine which compression decoder should be used.

Compression of output for 2.MapReduce jobs

If you want to compress the output of a mapreduce job, set the Mapred.output.compress property to true in the job configuration file. Set the Mapred.output.compression.codec property to the class name of the compression codec/decoder that you intend to use.

If you use a series of files for the output, you can set the Mapred.output.compression.type property to control the compression type by default to record, which compresses individual records. If you change it to block, you can compress a set of records. Because it has a better compression ratio, it is recommended to use.

Compression of 3.MAP job output results

Even if the MapReduce application uses non-compressed data to read and write, we can also benefit from the intermediate output of the compressed map phase. Because the output of the map job is written to the disk and transferred over the network to the Reducer node, if you use fast compression such as Lzo, you can get better performance because the amount of data transferred is greatly reduced. The following code shows the configuration properties that enable RNAP output compression and set the compression format.

Conf.setcompressmapoutput (TRUE);
Conf.setmapoutputcompressorclass (Gzipcodec.class);

  Local compression Library

With performance in mind, it is best to use a local library (native library) to compress and decompress. For example, in one test, the local gzip compression library was used to reduce the time to 50%, and the compression time was reduced by approximately 10% (compared to the built-in Java implementation). Table 4-4 shows the implementation of each of the compression formats available in Java and on-premises. Well not all formats have local implementations (such as bzip2 compression), while others have only local implementations (such as Lzo).

Compression format Java implementation Local implementation
DEFLATE Is Is
Gzip Is Is
Bzip2 Is Whether
LZO Whether Is

Hadoop has a local compression library of pre-provisioned 32-bit and 64-bit Linux, located in the library/local directory. For other platforms, you need to compile your own library, see the Wikipedia wiki.apache.org/hadoop/nativehadoop for Hadoop.

The local library is used by the Java System Properties Java.library.path. The Hadoop script is already set up in the Bin directory, but if you don't use the script, you'll need to set the properties in your app.

By default, Hadoop looks for a local library on the platform it runs on and automatically loads if found. This means that you can use a local library without having to change any configuration settings. In some cases, you might want to disable local libraries, such as when debugging a compression-related issue. To do this, set the property hadoop.native.lib to false to ensure that the built-in Java equivalent built-in implementation is used (if they are available).

Hadoop support for compressed files and the advantages and disadvantages of algorithms

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.