The least-missed compression knowledge in Hadoop

Source: Internet
Author: User
Tags sqoop


With the advent of the big Data age, the volume of data is increasing, processing the data will be more and more limited by the network IO, in order to deal with more data as much as possible we must use compression. So is compression in Hadoop suitable for all formats? What performance does it have?

Compression can be done inside the sqoop, and it can be done in hive and Impala. So under what circumstances would we use compression? Usually in a very large amount of data, we compress to reduce the amount of data, so as to achieve future use of data, reduce data transmission IO situation to use. Compression also has a role to play in improving performance and storage efficiency.

First, data compression

Compression is supported for each file format, which reduces disk space usage. But compression itself brings some overhead to the CPU, so compression requires a tradeoff between CPU time and bandwidth/storage space. Like what:

(1) Some algorithms can take a long time, but save more space.

(2) Some algorithms are faster, but the space saved is limited.

How does this come to be understood? For example, if 1T data is compressed into 100G, it may take 10 minutes. It can take up to 1 minutes to compress into 500G. Would you choose that way? So we need to make a trade-off between CPU time and bandwidth, and of course there's no way it's good or bad, but we choose according to the needs we use.

In addition, compression is good for performance : Many Hadoop jobs are limited by IO, and compression can process more data per IO operation, and compression can improve the performance of network transmissions.

Second, compression codecs

the implementation of the compression algorithm is called codec , is a shorthand for compressor/decompressor. Many codecs are commonly used in Hadoop, each with different performance characteristics. However, not all Hadoop tools are compatible with all codecs. The compression algorithms commonly used in Hadoop are bzip2, gzip, Lzo, snappy, where lzo and snappy require operating system installation native libraries to support them.

Here we look at the performance of the different compression tools:

650) this.width=650; "Src=" https://s3.51cto.com/wyfs02/M00/8C/BB/wKiom1h13HXzX79SAABWfV8C9bs202.png-wh_500x0-wm_ 3-wmp_4-s_2811479263.png "title=" 22.png "alt=" Wkiom1h13hxzx79saabwfv8c9bs202.png-wh_50 "/>

Bzip2 and gzip is more CPU-intensive, the compression ratio is the highest, gzip can not be divided into parallel processing; snappy and lzo are about the same, a little more than gzip, with less CPU consumption. In general, if you want to strike a balance between CPU and IO, it is more common to use snappy and lzo. Here I focus on the use of snappy, because it can provide good compression performance, and the compressed data can be fragmented, for the later processing of the operation has a great effect.

also note: For thermal data, speed is more important , a 1-second compression of 40% data is better than 10-second compression of 80% data.

Third, Sqoop use compression

Sqoop using the--COMPRESSION-CODEC flag

Example:

--compression-codec
Org.apache.hadoop.io.compress.SnappyCodec

Iv. Impala and Hive use compression

Impala and hive using compression, we need to specify in the syntax for creating the table. It is possible that the properties and syntax we specify are different for different compression.

Note: Impala queries data in memory-both compression and decompression are in memory

Impala Example:

650) this.width=650; "Src=" https://s5.51cto.com/wyfs02/M00/8C/B8/wKioL1h13IuyA2L1AADgkDJ-Xus266.png-wh_500x0-wm_ 3-wmp_4-s_3058579907.png "title=" 33.png "alt=" Wkiol1h13iuya2l1aadgkdj-xus266.png-wh_50 "/>

Suggest that we usually pay more attention to some big data related knowledge, and constantly improve their knowledge structure, I usually like to see "Big Data cn" This public number, inside the content for me is very good, also recommend everyone to see.


This article is from the "11872756" blog, please be sure to keep this source http://11882756.blog.51cto.com/11872756/1891080

The least-missed compression knowledge in Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.