Website: http://www.cnblogs.com/panfeng412/archive/2012/12/24/applications-scenario-summary-of-compression-algorithms.html
GZIP, LZO, zippy/snappy are commonly used in several compression algorithms, each has its own characteristics, so the application scenarios are different. Here combined with related engineering practice, do a summary.
Comparison of compression algorithms
Here's a set of test data that Google released a few years ago (the data is somewhat old and someone has recently tested it to be shared):
| Algorithm |
% remaining |
Encoding |
Decoding |
| Gzip |
13.4% |
MB/s |
118 MB/s |
| LZO |
20.5% |
135 MB/s |
410 MB/s |
| Zippy/snappy |
22.2% |
172 MB/s |
409 MB/s |
Note: From the hbase:the definitive guide
which
1) gzip compression rate is the highest, but in fact CPU-intensive, CPU consumption more than other algorithms, compression and decompression speed is also slow;
2) Lzo compression rate is centered, lower than gzip, but the compression and decompression speed is significantly faster than gzip, wherein the decompression speed more quickly;
3) Zippy/snappy compression rate is the lowest, and compression and decompression speed is slightly faster than Lzo.
Selection of compression algorithms in BigTable and HBase
The zippy algorithm is used in bigtable to achieve the fastest possible compression and decompression speed while reducing CPU consumption.
HBase, before snappy release (Google 2011 release Snappy), the use of Lzo algorithm, the target and bigtable similar; after snappy release, it is recommended to use snappy algorithm (refer to the Hbase:the Definitive guide), in particular, according to the actual situation of lzo and snappy have done more detailed comparison test before making a choice.
Practical experience in the actual project
The probability algorithm used in the project to use the Clearspring company's Open source cardinality estimation:Stream-lib, which solves the problem of de-recalculation, such as UV computing, is characterized by:
1) A UV calculation, can be limited to a fixed size of the bitmap space to complete (different sizes, corresponding to different error rates), such as 8k,64k;
2) Different bitmaps can be combined to get the merged Uvs.
The more bitmaps are maintained in the system, the more storage space is consumed, whether in memory or in the storage System (MySQL, HBase, etc.). Therefore, it is necessary to consider an appropriate algorithm to compress the bitmap. This is divided into the following two types of situations:
1) when the bitmap in memory, the choice of compression algorithm at this time, must have as fast as possible compression and decompression speed, and can not consume too much CPU resources, it is suitable to use LZO or snappy compression algorithm, to achieve rapid compression and decompression;
2) When the bitmap is stored in the DB, it is more concerned about the storage space savings, to have the highest possible compression rate, so the use of gzip compression algorithm, while in the process from memory dump to DB can also reduce the transmission overhead of network IO.
Summary words
The above is a summary comparison of the features of Gzip, LZO, zippy/snappy compression algorithms, and some practical methods. If there is anything wrong, please correct me and discuss.
Compression algorithms in HBase compare GZIP, LZO, Zippy, Snappy [go]