Seven compression for hive Architecture Optimization

Source: Internet
Author: User

Common compression methods include:Compressing intermediate results and output results.

 

Compression comparison:

Algorithm Before/after compression Compression speed Decompression speed
Gzip 13.4% 21 MB/S 118 MB/S
Lzo 20.5% 135 MB/S 410 MB/S
Snappy 22.2% 172 MB/S 409 MB/S

 

 

 

 

Snappy introduction:

Snappy: http://code.google.com/p/snappy/

The predecessor of snappy is Zippy. Although it is only a data compression library, it is used by Google for many internal projects, including bigtable, mapreduce and RPC. Google claims that it has optimized the data processing speed of the database itself and its algorithms at the cost, without considering the output size and compatibility with other similar tools. Snappy is specially optimized for 64-bit x86 processors. It can achieve at least MB of compression rate per second and MB of decompression rate per second on a single Intel core i7 processor kernel.

If some compression ratios are allowed to be lost, the compression speed can be higher, although the generated compressed file may be larger than other libraries by 20% to 100%.However, compared with other compression libraries, snappy can have an astonishing compression speed under a specific compression ratio. "The speed of compressing common text files is 1.5-1.7 times that of other libraries, HTML can reach 2-4 times, but the compression speed of JPEG, PNG and other compressed data will not be significantly improved ".

 

Compression Technology Integration:

Snappy-java-1.0.4.1.jar integrated under: hadoop-2.0.0-cdh4.5.0 \ share \ hadoop \ sub-project \ Lib

Core-site.xml

The compression classes to be configured are separated by commas (,). Several implementation classes are configured here.

<property>    <name>io.compression.codecs</name>    <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Cod        ec,org.apache.hadoop.io.compress.DeflateCodec,        org.apache.hadoop.io.compress.SnappyCodec</value></property>

 

Mapred-site.xml

<! -- The reduce stage is valid --> <property> <Name> mapred. output. compression. codec </Name> <value> Org. apache. hadoop. io. compress. defaultcodec </value> </property> <! -- Map stage is valid --> <property> <Name> mapred. map. output. compression. codec </Name> <value> Org. apache. hadoop. io. compress. snappycodec </value> </property> <! -- Whether the map stage is compressed --> <property> <Name> mapred. Compress. Map. Output </Name> <value> true </value> </property> <! -- Whether the reduce stage is compressed --> <property> <Name> mapred. output. compress </Name> <value> false </value> </property> <Name> mapred. output. compression. type </Name> <value> block </value> </property>

Why sometimes only compression of map output is selected?

Improves the Data Transmission Performance between map and reduce;

In this configuration, data stored in reduce is not compressed, that is, the data stored in HDFS is not compressed;

Under normal circumstances: reduce also needs to be compressed

The preceding configuration file is valid for both hadoop and hive.

In the production environment, we recommend that you mapAnd reduceEnd-to-End Compression.

 

How to Set to be valid only for hive (valid for a single window or as a whole as configured in the hive-site.xml ):

SET hive.exec.compress.output=true;SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;SET mapred.output.compression.type=BLOCK;

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.