Learn about hadoop file formats and compression, we have the largest and most updated hadoop file formats and compression information on alibabacloud.com
Currently in Hadoop used more than lzo,gzip,snappy,bzip2 these 4 kinds of compression format, the author based on practical experience to introduce the advantages and disadvantages of these 4 compression formats and application scenarios, so that we in practice according to the actual situation to choose different
as the output of a mapreduce job and the input of another mapreduce job.
4 bzip2 Compression
Advantages: Support Split, with a high compression rate, than the gzip compression rate is high; Hadoop itself supports, but does not support native, the Linux system with bzip2 command, easy to use.
Disadvantage:
not support native; Bzip2 command is provided in Linux for ease of use.
Disadvantages: the compression/Decompression speed is slow; Native is not supported.
Application Scenario: Suitable for scenarios where the speed requirement is not high, but the compression ratio is high, it can be used as the output format of mapreduce jobs; or the output data is large, after processing, the data needs to be compre
, but does not support native; it is easy to use with BZIP2 commands in Linux systems.
Disadvantage: Compression/decompression speed is slow; native is not supported.
Application scenario: Suitable for the speed requirements are not high, but need high compression rate, can be used as the output format of the MapReduce job, or the data after the output is larger, the data after processing need to compress t
,CompressionCodec.class);
4. is the use of hadoop-0.19.1 to compare a task with three compression methods:
Read non-compressed files. The intermediate results are not compressed, and the output results are not compressed.
Read the compressed file. The intermediate results are not compressed, and the output results are not compressed.
The value of HDFS
also generate more compression for some file types than GZip, but compression and decompression will affect speed to some extent. HBase does not support BZIP2 compression.
Snappy usually perform better than LZO. You should run tests to see if you detect a noticeable difference.
For MapReduce, if you need the c
, even the size of the 16bits x texture in the video memory is as high as 2 MB. To speed up rendering and reduce image aliasing, you can use Mipmap to process textures into files composed of a series of pre-computed and Optimized images. Of course, Mipmap requires a certain amount of memory space.
Our common image file formats are:
BMP: Windows standard image file
Hadoop's support for compressed files
Hadoop supports transparent identification of compression formats, and execution of our mapreduce tasks is transparent. hadoop can automatically decompress the compressed files for us without worrying about them.
If the compressed file
Label: style HTTP color Io OS ar Java1 compression
Generally, data processed by computers has some redundancy and there is correlation between data, especially between adjacent data. Therefore, data can be stored in special encoding methods different from the original encoding, make the storage space occupied by data relatively small, this process is generally called compression. The concept corresponding t
file suffix
Unzip Command
Compress Command
. zip (Requires zip)
Unzip File.zip
Zip File.zip DirName
. RAR (Requires RAR)
RAR x File.rar
RAR a File.rar
. Tar (packaged, not compressed)
Tar xvf File.tar
Tar cvf File.tar DirName
. tar.gz,. tgz
Tar zxvf File.tar.gz
Tar zcvf File.tar.gz DirName
. tar.bz2,. tar.bz
Tar jxvf File.t
calculate the checksum again when the data is transmitted through an unreliable channel, in this way, you can see whether the data is damaged. If the two calculation checksum does not match, you think the data is damaged. However, this technology cannot repair the data and can only detect errors. Common error detection code is CRC-32 (cyclic redundancy check), any size of data input is calculated to get a 32-bit integer checksum.
6. Compress and input parts
Algorithm
File name extension
Multiple files
Severability
DEFLATE
No
DEFLATE
. Deflate
No
No
Gzip
Gzip
DEFLATE
. gz
No
No
Zip
Zip
DEFLATE
. zip
Is
Yes, within the scope of the file
Bzip2
Bzip2
Bzip2
. bz2
No
Is
LZO
Algorithm
File name extension
Multiple files
Severability
DEFLATE
No
DEFLATE
. Deflate
No
No
Gzip
Gzip
DEFLATE
. gz
No
No
Zip
Zip
DEFLATE
. zip
Is
Yes, within the scope of the file
Bzip2
Bzip2
Bzip2
. bz2
No
Is
LZO
Summary of five commonly used picture formats and whether they have data compressionDisclaimer: Reference Please specify source http://blog.csdn.net/lg1259156776/Description: This article focuses on the five most common and most commonly used image formats: bmp,png,jpeg,jpeg200, and GIF. The first step before image processing related applications is to be able to read these image files, although many develo
compression of common data compression algorithmsThere are two main advantages of file compression, one is to reduce the space for storing files, and the other is to speed up data transmission. In the context of Hadoop big data, these two points are especially important, so
Reprinted please indicate the source: hadoop in-depth research: (7) -- Compression
File compression has two main advantages: one is to reduce the storage space of files, and the other is to speed up data transmission. In the context of hadoop big data, these two points are p
This section roughly summarizes the compression and decompression methods for various formats of compressed packages in linux. However, I have not used some of the methods. I hope you can help me with them. I will modify them at any time. thank you! What are the compression formats in LINUX ?. Tar decommission: tarxvfF
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.