[Read hadoop source code] [4]-org. apache. hadoop. io. compress Series 3-use Compression

Source: Internet
Author: User
Tags gz file hadoop mapreduce
Document directory
  • 1. Read the compressed input file directly
  • 2. compress the intermediate results produced by mapreduce job
  • 3. compress the final computing output results
  • 4. is the use of hadoop-0.19.1 to compare a task with three compression methods:
  • 5. For more information about how to use lzo with high compression and compression, see the following url.

Hadoop supports multiple compression methods, such as Gzip, bzip2, and zlib. Gzip is a built-in compression method supported by hadoop, this compression method is widely used by linux developers and administrators, with a high compression ratio and a good compression speed, therefore, many people prefer to use this compression format to compress files.

In hadoop, it is easier to use gzip compression in mapreduce jobs. I don't remember the version from which hadoop started. hadoop built in to read the input file in gzip compression format, supports writing intermediate results and output results.

1. Read the compressed input file directly

When hadoop reads an input file, it intelligently determines whether to use a compressed format based on the input file suffix. Therefore, when it reads an input file, it is ***. when gz is used, it is estimated that the file is a file compressed with gzip, so it will try to read it using gzip.

 Public CompressionCodecFactory (Configuration conf) {codecs = new TreeMap <String, CompressionCodec> (); List <Class <? Extends CompressionCodec> codecClasses = getCodecClasses (conf); // conf. get ("io. compression. codecs "); obtain the configuration decoder from this configuration if (codecClasses = null) {addCodec (new GzipCodec ()); // if there is no configuration in the core-site.xml, there is the default 2 addCodec (new DefaultCodec ();} else {Iterator <Class <? Extends CompressionCodec> itr = codecClasses. iterator (); while (itr. hasNext () {CompressionCodec codec = ReflectionUtils. newInstance (itr. next (), conf); addCodec (codec );}}}

 

If other compression methods are used, this can be configured in the core-site.xml

 <Property> <name> io. compression. codecs </name> <value> org. apache. hadoop. io. compress. gzipCodec, org. apache. hadoop. io. compress. defaultCodec, com. hadoop. compression. lzo. lzoCodec, com. hadoop. compression. lzo. lzopCodec </value> </property>

 

Or in the code

 Conf. set ("io. compression. codecs "," org. apache. hadoop. io. compress. defaultCodec, org. apache. hadoop. io. compress. gzipCodec, com. hadoop. compression. lzo. lzopCodec ");

 

Both the default inputformat and outputformat contain the encoding/decoding detection code. If you implement the format yourself, you may need to add the following code yourself.

Implement inputformat

 CompressionCodecFactory compressionCodecs = new CompressionCodecFactory (job); final CompressionCodec codec = compressionCodecs. getCodec (file); CompressionInputStreamin = codec. createInputStream (fileIn );....

Implement outputformat

 Class <? Extends CompressionCodec> codecClass = getOutputCompressorClass (job, GzipCodec. class); CompressionCodec codec = (CompressionCodec) ReflectionUtils. newInstance (codecClass, job); Path file = FileOutputFormat. getTaskOutputPath (job, name + codec. getDefaultExtension (); FileSystem fs = file. getFileSystem (job); FSDataOutputStream fileOut = fs. create (file, progress); CompressionOutputStreamout = codec. createOutputStream (fileOut ))....

2. compress the intermediate results produced by mapreduce job

Due to the characteristics of the mapreduce algorithm itself, certain intermediate result files will be generated during job operation. When the data volume is large, these intermediate results are also very objective, to a certain extent, it will have a certain impact on the job efficiency. Because the bottleneck of computing tasks is usually on disk read/write IO, If you can reduce disk IO caused by files in the middle, the efficiency of the job is certainly beneficial. Therefore, if you want to compress the intermediate results of mapreduce jobs, you can modify the hadoop-site in the hadoop conf file. xml configuration options can be set in the program using the JobConf class interface, or when submitting a job using the-D option to set this option:

 <Property> <name> mapred. compress. map. output </name> <value> true </value> </property>

You can also specify

 conf.setCompressMapOutput(true);conf.setMapOutputCompressorClass(GzipCodec.class);

In this way, the job will compress the results when writing the intermediate results to slave local. When reading reduce, it can also know that the intermediate result is a compressed file according to the gz suffix, the corresponding read format is used for reading. About this intermediate result compression, there may be a bug in version 0.19, specific reference http://hi.chinaunix.net /? Uid-9976001-action-viewspace-itemid-48083 3. compress the final computation output results

Sometimes, we need to save the running results of the job in history. However, if the computing results accumulated every day are very large, we want to save as many historical results as possible for future settlement, after a long period of time, it will occupy a very large HDFS storage space, and because it is historical data, the usage frequency is not high, this will cause a great waste of storage space. Therefore, compressing the computing results is also a very good method to save space. It is easy to do this in hadoop job. Just tell hadoop, "I want to compress the output results of multiple jobs and save them to HDFS. The specific operation is to set configuration options in conf:

 <property>       <name>mapred.output.compress</name>       <value>true</value></property>

You can also specify

 conf.setBoolean("mapred.output.compress", true);conf.setClass("mapred.output.compression.codec", GzipCodec.class,CompressionCodec.class);

4. is the use of hadoop-0.19.1 to compare a task with three compression methods:

    • Read non-compressed files. The intermediate results are not compressed, and the output results are not compressed.

    • Read the compressed file. The intermediate results are not compressed, and the output results are not compressed.

    The value of HDFS bytes written-Map is significantly reduced.

    • Reads non-compressed files, compresses intermediate results, and does not compress output results

    The Local bytes read-Map/Reduce and Local bytes written-Map/Reduce values are significantly reduced.

    • Read non-compressed files, the intermediate results are not compressed, and the output results are compressed

    The value of HDFS bytes written-Reduce is significantly reduced.
    Therefore, we can see that gzip compression is used in hadoop for reading, and it is very easy to store intermediate results and data results, because hadoop native itself provides our corresponding classes to parse and compress data. However, it is particularly mentioned that gzip compression formats have certain limitations in hadoop support: Due to the gzip compression algorithm itself, we cannot block gzip Compression Files. That is to say, in hadoop, if you want to use hadoop mapreduce to process data, no mapper must correspond to a gz file, multiple mappers cannot concurrently process multiple chunks in a gzip file. Therefore, to use gzip in a hadoop mapreduce task, you must split the data before processing the data, let a mapper process a piece of data. This is actually against the nature of mapreduce. [What are the problems described by these counters? Please refer to another blog?] 5. For the use of compression and decompression are very fast lzo can refer to the following urlhttp: // fuse reference http://hi.chinaunix.net /? Uid-9976001-action-viewspace-itemid-45151

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.