1 compression
Generally, data processed by computers has some redundancy and there is correlation between data, especially between adjacent data. Therefore, data can be stored in special encoding methods different from the original encoding, make the storage space occupied by data relatively small, this process is generally called compression. The concept corresponding to compression is decompression, which is the process of restoring the compressed data from the special encoding method to the original data.
Compression is widely used in massive data processing. compressing data files can effectively reduce the space required for storing files and speed up data transmission over the network or on disks. In hadoop, compression is applied to file storage, data exchange from the map stage to the reduce stage (related options need to be enabled), and other scenarios.
There are many ways to compress data. Data with different characteristics has different data compression methods: for example, compression of special data such as sound and image can adopt lossy compression methods, the compression process is allowed to lose a certain amount of information in exchange for a relatively large compression ratio. However, the compression of music data has its own special encoding method, therefore, some dedicated data compression algorithms for these special codes can also be used.
2 Introduction to hadoop Compression
As a common mass data processing platform, hadoop mainly considers the compression speed and the severability of compressed files in terms of compression methods.
All compression algorithms will consider the trade-off between time and space. Faster compression and decompression usually consume more space (lower compression ratio ). For example, when using the gzip command to compress data, you can set different options to choose speed first or space first. Option-1 indicates speed first, and option-9 indicates space optimum, you can obtain the maximum compression ratio. Note that the compression and decompression speeds of some compression algorithms are significantly different: gzip and zip are common compression tools and are relatively balanced in time/space processing, gzip2 compression ratio gzip and zip are more effective, but the speed is slow, and Bzip2 decompression speed is faster than its compression speed.
When using mapreduce to process compressed files, you need to consider the severability of the compressed files. Consider that we need to process a 1 GB text file on HDFS. When the size of the HDFS data block is 64 MB, the file is stored as 16, the corresponding mapreduce job divides the file into 16 input partitions and provides 16 independent map tasks for processing. However, if the file is a compressed file in GZIP format (with the same size), The mapreduce job cannot divide the file into 16 parts, because it is impossible to start from a certain point in the gzip data stream, decompress the data. However, if the file is a compressed file in Bzip2 format, the mapreduce job can compress the blocks in the file in Bzip2 format and divide the input into several input parts, decompress the data from the beginning of the block. In Bzip2 compressed files, a 48-Bit Synchronization mark is provided between blocks. Bzip2 supports data splitting.
Table 3-2 lists common compression formats and features that can be used for hadoop.
Table 3-2 COMPRESSION formats supported by hadoop
To support multiple compression and decompression algorithms, hadoop introduces the encoding/decoder. Similar to the hadoop serialization framework, the encoding/Decoder uses the abstract factory design pattern. Currently, the encoding/decoder supported by hadoop is shown in Table 3-3.
Table 3-3 Compression Algorithm and Its Encoding/Decoder
The compression and decompression tools corresponding to the same compression method can all be obtained through the corresponding encoding/decoder.
3. hadoop compression API application instance
Import Java. io. ioexception; import Java. io. inputstream; import Java. io. outputstream; import java.net. uri; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. fsdatainputstream; import Org. apache. hadoop. FS. fsdataoutputstream; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. ioutils; import Org. apache. hadoop. io. compress. compressioncodec; Import Org. apache. hadoop. io. compress. compressioncodecfactory; import Org. apache. hadoop. io. compress. compressioninputstream; import Org. apache. hadoop. io. compress. compressionoutputstream; import Org. apache. hadoop. util. reflectionutils; public class codectest {public static void main (string [] ARGs) throws exception {compress ("org. apache. hadoop. io. compress. bzip2codec "); // compress (" org. apache. hadoop. io. CO Mpress. gzipcodec "); // compress (" org. apache. hadoop. io. compress. lz4codec "); // compress (" org. apache. hadoop. io. compress. snappycodec "); // uncompress (" text "); // uncompress1 (" HDFS: // master: 9000/user/hadoop/text.gz ");} // compressed file public static void compress (string codecclassname) throws exception {class <?> Codecclass = Class. forname (codecclassname); configuration conf = new configuration (); filesystem FS = filesystem. get (CONF); compressioncodec codec = (compressioncodec) reflectionutils. newinstance (codecclass, conf); // the input and output are both HDFS path fsdatainputstream in = FS. open (New Path ("/test. log "); fsdataoutputstream outputstream = FS. create (New Path ("/test1.bz2"); system. out. println ("compress start! "); // Create a compressed output stream compressionoutputstream out = codec. createoutputstream (outputstream); ioutils. copybytes (In, out, conf); ioutils. closestream (in); ioutils. closestream (out); system. out. println ("compress OK! ") ;}// Decompress public static void uncompress (string filename) throws exception {class <?> Codecclass = Class. forname ("org. apache. hadoop. io. compress. gzipcodec "); configuration conf = new configuration (); filesystem FS = filesystem. get (CONF); compressioncodec codec = (compressioncodec) reflectionutils. newinstance (codecclass, conf); fsdatainputstream inputstream = FS. open (New Path ("/user/hadoop/text.gz"); // extract the text file to the data and output it to the console inputstream in = codec. createinputstream (inputstream); ioutils. copybytes (in, system. out, conf); ioutils. closestream (in);} // uses the file extension to infer the codec used to decompress the file. Public static void uncompress1 (string URI) throws ioexception {configuration conf = new configuration (); filesystem FS = filesystem. get (URI. create (URI), conf); Path inputpath = New Path (URI); compressioncodecfactory factory = new compressioncodecfactory (CONF); compressioncodec codec = factory. getcodec (inputpath); If (codec = NULL) {system. out. println ("No codec found for" + URI); system. exit (1);} string outputuri = compressioncodecfactory. removesuffix (Uri, codec. getdefaultextension (); inputstream in = NULL; outputstream out = NULL; try {In = codec. createinputstream (FS. open (inputpath); out = FS. create (New Path (outputuri); ioutils. copybytes (In, out, conf);} finally {ioutils. closestream (out); ioutils. closestream (in );}}}
Hadoop compression and decompression