The least-missed compression knowledge in Hadoop

Last Update:2017-01-12 Source: Internet

Author: User

Tags sqoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

With the advent of the big Data age, the volume of data is increasing, processing the data will be more and more limited by the network IO, in order to deal with more data as much as possible we must use compression. So is compression in Hadoop suitable for all formats? What performance does it have?

Compression can be done inside the sqoop, and it can be done in hive and Impala. So under what circumstances would we use compression? Usually in a very large amount of data, we compress to reduce the amount of data, so as to achieve future use of data, reduce data transmission IO situation to use. Compression also has a role to play in improving performance and storage efficiency.

First, data compression

Compression is supported for each file format, which reduces disk space usage. But compression itself brings some overhead to the CPU, so compression requires a tradeoff between CPU time and bandwidth/storage space. Like what:

(1) Some algorithms can take a long time, but save more space.

(2) Some algorithms are faster, but the space saved is limited.

How does this come to be understood? For example, if 1T data is compressed into 100G, it may take 10 minutes. It can take up to 1 minutes to compress into 500G. Would you choose that way? So we need to make a trade-off between CPU time and bandwidth, and of course there's no way it's good or bad, but we choose according to the needs we use.

In addition, compression is good for performance : Many Hadoop jobs are limited by IO, and compression can process more data per IO operation, and compression can improve the performance of network transmissions.

Second, compression codecs

the implementation of the compression algorithm is called codec , is a shorthand for compressor/decompressor. Many codecs are commonly used in Hadoop, each with different performance characteristics. However, not all Hadoop tools are compatible with all codecs. The compression algorithms commonly used in Hadoop are bzip2, gzip, Lzo, snappy, where lzo and snappy require operating system installation native libraries to support them.

Here we look at the performance of the different compression tools:

650) this.width=650; "Src=" https://s3.51cto.com/wyfs02/M00/8C/BB/wKiom1h13HXzX79SAABWfV8C9bs202.png-wh_500x0-wm_ 3-wmp_4-s_2811479263.png "title=" 22.png "alt=" Wkiom1h13hxzx79saabwfv8c9bs202.png-wh_50 "/>

Bzip2 and gzip is more CPU-intensive, the compression ratio is the highest, gzip can not be divided into parallel processing; snappy and lzo are about the same, a little more than gzip, with less CPU consumption. In general, if you want to strike a balance between CPU and IO, it is more common to use snappy and lzo. Here I focus on the use of snappy, because it can provide good compression performance, and the compressed data can be fragmented, for the later processing of the operation has a great effect.

also note: For thermal data, speed is more important , a 1-second compression of 40% data is better than 10-second compression of 80% data.

Third, Sqoop use compression

Sqoop using the--COMPRESSION-CODEC flag

Example:

--compression-codec
Org.apache.hadoop.io.compress.SnappyCodec

Iv. Impala and Hive use compression

Impala and hive using compression, we need to specify in the syntax for creating the table. It is possible that the properties and syntax we specify are different for different compression.

Note: Impala queries data in memory-both compression and decompression are in memory

Impala Example:

650) this.width=650; "Src=" https://s5.51cto.com/wyfs02/M00/8C/B8/wKioL1h13IuyA2L1AADgkDJ-Xus266.png-wh_500x0-wm_ 3-wmp_4-s_3058579907.png "title=" 33.png "alt=" Wkiol1h13iuya2l1aadgkdj-xus266.png-wh_50 "/>

Suggest that we usually pay more attention to some big data related knowledge, and constantly improve their knowledge structure, I usually like to see "Big Data cn" This public number, inside the content for me is very good, also recommend everyone to see.

This article is from the "11872756" blog, please be sure to keep this source http://11882756.blog.51cto.com/11872756/1891080

The least-missed compression knowledge in Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More