Using compression in Spark programs

Source: Internet
Author: User

Data is suitable for compression in large contiguous areas where data is stored and the data in the storage area is highly repeatable. The data blocks after the array or object serialization can be considered for compression. So the serialized data can be compressed, so that the data is condensed and the space cost is reduced.

1. Spark's choice of compression methods

Compression employs two algorithms: Snappy and LZF, with two third-party libraries implemented at the bottom, and the ability to customize other compression libraries to extend spark. The snappy provides a higher compression speed, LZF provides a higher compression ratio, and the user can choose the compression method according to the specific requirements.
The compression format and the encoder are as follows.
· LZF:org.apache.spark.io.LZFCompressionCodec.
· Snappy:org.apache.spark.io.SnappyCompressionCodec.

The comparison of compression algorithms is shown in 4-9.
(1) ning-compress
Ning-compress is a library of LZF format compression and decompression of data, written by Tatusaloranta (T[email protected]. Fi). Users can download it at GitHub address: https://github.com/ning/compress for learning and research.

(2) Snappy-java
The snappy algorithm, formerly known as Zippy, was used by Google for many internal projects such as MapReduce, BigTable, and so on. Snappy-java, developed by Google, is a Java branch of the snappy Compression decompression library developed in C + +. The GitHub address is: Https://github.com/xerial/snappy-java.
The goal of the snappy is to provide a high compression speed library in the case of a reasonable amount of compression. Therefore, the compression ratio of snappy is similar to that of LZF, not very high. Depending on the data set, the compression ratio can reach 20%~100%. Interested readers can look at a compression algorithm benchmark, which compares the compression libraries based on the JVM's running language. This benchmark compares Snappy-java and other compression tools Lzo-java/lzf/qui cklz/gzip/bzip2. The address is github:https://github.com/ning/jvm-compressor-benchmark/wiki. This benchmark was developed by Tatu [email protected]. Snappy is usually faster than similar lzo, LZF, Fastlz, and qui cklz when it achieves quite a compression. It is about the compression ratio of the plain text is probably 1.5~1.7x, to the HTML page is 2~4x, to the picture and so on binary data basic does not compress, is 1x. The snappy is optimized for both 64-bit and 32-bit processors, both 32-bit and 64-bit processors, to achieve high efficiency. According to the official introduction, snappy through the PB-level big data test, stability is not a problem, Google's map reduce, RPC and many other frameworks have used the snappy compression algorithm.
Compression is a trade-off between time and space. Longer compression and decompression times will save more space. A small footprint means that more data can be cached, saving I/O time and network transmission time. Different compression algorithms are a tradeoff in different contexts, and the compression of different data type files can vary. Refer to Figure 4-9 for a tradeoff between the use of different algorithms.

2. Using compression in the Spark program

Users can configure compression in the following two ways.
(1) Configuring in the Spark-env.sh file
The user can set the parameters of the compression configuration in the pre-boot configuration file spark-env.sh.

Export spark_java_opts= "-dspark.broadcast.compress"

(2) configuring in the application
SC is the Sparkcontext object, and Conf is the Sparkconf object.

Val conf=sc.getconf

1) Get the compressed configuration.

Conf.getboolean ("Spark.broadcast.compress",true)

2) Compressed configuration.

Conf.set ("Spark.broadcast.compress",true)

The other parameters are shown in table 4-2:

In distributed computing, serialization and compression are two important means. Spark transforms the chain-distributed data into continuously distributed data through serialization, enabling distributed inter-process data communication, or in-memory compression, to improve the application performance of Spark. Compression reduces the memory footprint of the data, as well as the IO and network data transfer overhead.


Using compression in Spark programs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.