History of lossless data compression algorithms

Source: Internet
Author: User
Introduction

There are two main compression algorithms: lossy and lossless. Lossy compression algorithms reduce files by removing small details that require a large amount of data to be stored under fidelity conditions. In lossy compression, it is impossible to restore the original file because some necessary data is removed. Lossy compression is mainly used to store images and audio files. At the same time, a high compression rate can be achieved by removing data. However, this article does not discuss lossless compression. Lossless compression also makes the file smaller, but the corresponding decompression function can precisely restore the original file without losing any data. Lossless data compression is widely used in the computer field, from saving your personal computer space to sending data through the Web. Use Secure Shell to view PNG or GIF images.

The fundamental principle of lossless compression algorithms is that any non-random file contains duplicate data, which can be compressed by statistical modeling techniques used to determine the probability of occurrence of characters or phrases. Statistical models can be used to generate code for specific characters or phrases. Based on the frequency of their appearance, you can configure the shortest code for the most commonly used data. These technologies include entropy encoding, run-length encoding, and dictionary compression. Using these technologies and other technologies, an 8-bit character or string can be expressed with a small number of BITs, removing a large amount of repeated data.

History


It was not until 1970s that data compression began to play an important role in the computer field. At that time, the Internet became more popular and Lempel-Ziv algorithm was invented, however, compression algorithms have a longer history in the computer field. The Morse code, invented in 1838, is the earliest data compression instance. It allocates shorter Morse code for the most common English letters such as "E" and "T. Later, with the rise of the mainframe, clude Shannon and Robert Fano invented the Shannon-Fano encoding algorithm. Their algorithms assign codes to the symbols based on the probability that the symbols appear ). The probability size of a symbol is inversely proportional to the corresponding encoding, so that the symbol is expressed in a shorter way.

Two years later, David Huffman studied information theory at MIT and took a course by Robert Fano. Fano gave two options to the students in the class to write a semester paper or take the final exam. Huffman chose to write a semester paper and the question was to find the optimal binary algorithm. After several months of hard work, Huffman decided to give up all work related to the paper and start learning to prepare for the final exam. At that time, Huffman found a more effective coding algorithm similar to Shannon-Fano encoding. The main difference between Shannon-Fano encoding and Huffman encoding is that the process of building a probability tree is different. The former is bottom-up and gets a sub-optimal result, while the latter is top-down.

The early implementation of Shannon-Fano encoding and Huffman Encoding algorithms was completed by hardware and hard coding. It was not until the emergence of Internet and online storage in 1970s that software compression was implemented to generate the Huffman encoding dynamically based on input data. Subsequently, in 1977, Abraham Lempel and Jacob Ziv published their original lz77 algorithm, the first algorithm to use dictionaries to compress data. In particular, lz77 uses a dynamic Dictionary called slidingwindow. In 1778, the partner published the lz78 algorithm using the dictionary. Unlike lz77, lz78 parses the input data to generate a static dictionary, which is not dynamically generated like lz77.

Legal issues

Lz77 and lz78 are quickly popular and derived from multiple compression algorithms. Most of them have been silenced, and only a few of them are currently used in a wide range, including deflate, lzma, and LZX. The vast majority of common compression algorithms are derived from lz77. This is not because the lz77 technology is better, but because Sperry applied for the lz78 derivative algorithm LZW patent in 1984, and thus the development is hindered by the patent, sperry began to sue software providers, server administrators, and even end users who use GIF format but do not have license for patent infringement.

At the same time, the Unix compression tool used a slightly adjusted LZW algorithm called LZC, which was subsequently discarded due to patent issues. Other UNIX developers began to abandon LZW. This leads to the use of deflate-based gzip and Burrows-Wheeler transform-based Bzip2 algorithms in the Unix community. In the long run, this is good for the UNIX community, because gzip and Bzip2 formats almost always have a better compression ratio than LZW. The patent issue surrounding LZW has ended, because LZW's patent expired in 2003. Despite this, the LZW algorithm has been replaced to a large extent and only used in GIF compression. Since then, some LZW derivative algorithms have not been popular, and lz77 algorithms are still mainstream.

Another legal lawsuit occurred in 1993, about LZs algorithms. LZs is developed by STAC electronics for hard disk compression software such as stacker. Microsoft used LZs algorithm when developing the film compression software, and the software was released along with MS-DOS 6.0, claiming to be able to double the hard disk capacity. When STAC electronics found that his intellectual property was used, he sued Microsoft. Microsoft was subsequently sentenced to patent infringement and compensated STAC electronics1 for $20 million, and then reduced by $13.6 million due to Microsoft's appeal for unintentional infringement. Although STAC electronics and Microsoft have had such a large lawsuit, it does not impede Lempel-Ziv algorithm development, unlike LZW patent disputes. The only result is that LZs does not produce any algorithms.

Rise of deflate

Since Lempel-Ziv algorithm was published, as storage needs continue to grow, some companies and other groups have begun to use data compression technology to meet these needs. However, data compression was not widely used. This situation did not change until the end of 1980s with the soaring Internet. At that time, the demand for data compression emerged. Bandwidth limit, expensive, data compression can help ease these bottlenecks. After the development of the world wide, people began to share more images and data in other formats. The data was much larger than the text, and compression began to become extremely important. To meet these requirements, several new file formats are developed, including ZIP, GIF, and PNG.

Thom Henderson released the first successful commercial archive format, called arc, through his company. The company name is system enhancement associates. ARC is particularly popular in the BBS Community because it is the first program that can be packaged and compressed, and the source code is also open. The arc format uses an LZW derivative algorithm to compress data. A family named Phil Katz noticed the popularity of arc and decided to rewrite the compression and decompression program in assembly language, hoping to improve the arc. He released his shared software pkarc program in 1987 and was soon sued by Henderson for copyright infringement. Katz was convicted and forced to pay copyright fees and other license agreement fees. He was determined to be infringing because pkarc was a serious copy of the arc, and even some of the typos in the comments were the same.

Phil Katz was unable to continue selling pkarc since 1988 due to license issues, so in 1989 he created a revised version of pkarc, which is now a well-known ZIP format. Because LZW was used and it was considered patent infringement, Katz chose to use the new implode algorithm, which was modified again in 1993, when kata released PKZIP version 2.0, that version implements the deflate algorithm and other features, such as split capacity. This zip version is now available everywhere. All ZIP files follow the PKZIP 2.0 format, even though it is a long time ago.

GIF format, the full name of Graphics Interchange Format, was created by CompuServe in 1987 and allows the image to be shared without distortion (although this format is limited to a maximum of 256 colors per frame ), at the same time, the file size is reduced to allow transmission through the data machine. However, like the ZIP format, GIF is also based on the LZW algorithm. Despite patent infringement, Unisys cannot prevent the spread of GIF images. Even now, 20 years later, GIF is still used, especially its animation capabilities.

Although GIF cannot be stopped, CompuServe needs to find a patent-free format and introduced the portable network graphics (PNG) format in 1994. Like zip, PNG uses the deflate algorithm to process compression. Although the patent of dellate belongs to Katz, it is not a strong one. In this case, PNG and other deflate-based formats avoid patent infringement. Although LZW occupies the dominant position in the early stage of compression history, LZW gradually fades out of the mainstream due to Unisys's outstanding litigation, and everyone is switching to a faster and more efficient deflate algorithm. Currently, deflate is the most widely used algorithm, which compresses the taste of Swiss Army knives in the world.

In addition to PNG and zip formats, deflate in the computer world is also frequently used elsewhere. For example, deflate is also used in the gzip(.gz) file format. Gzip is an open-source version of zip. Others also include HTTP, SSL, and other technologies that efficiently compress network data transmission.

Unfortunately, Phil Katz died young and failed to see his deflate algorithm govern the computer world. For a few years he suffered from alcohol abuse and his life began to break down in the late 1990s s. He was arrested several times for drunk driving or other illegal activities. Katz was found to have died in a hotel room in April 14, 2000, aged 37. The cause of death was alcohol-caused serious pancreatic bleeding, followed by a pile of empty bottles.

Some existing archive Software

Zip and other deflate-based formats were dominant until the middle of 1990s, and some new improved formats began to appear. In 1993, Eugene roshal released an archive software named WinRAR, which uses the RAR format. The latest RAR combines the ppm and lzss algorithms, and the previous version is not clear. RAR began to become a de facto standard for sharing files on the Internet, especially the spread of pirated images. In 1996, an open-source implementation of the Burrows-Wheeler transform algorithm called bzip2d was released, and soon became popular on UNIX platforms. It was very popular against the GZIP format based on the deflate algorithm. In 1999, another open-source compression program exists in the format of 7-zipor .7z. 7-zip should be the first format that can challenge the dominant position of zip and rar, because of its high compression ratio, modularity and openness. This format is not limited to the use of a compression algorithm, but can be any choice between the Bzip2, lzma, lzma2, and ppmd algorithms. Finally, the newer PAQ * format in archive software is used. The first PAQ version was released by Matt Mahoney in 2002, called paq1. PAQ uses a technology called context mixing to improve the PPM algorithm. Context mixing combines two or more statistical models to generate a better symbolic Probability Prediction, this is better than any other model.

Compression Technology

Many different technologies are used to compress data. Most technologies cannot be used independently and need to be combined to form a set of algorithms. Those technologies that can be used separately are generally more effective than those that need to be combined. Most of them are under the entropy encoding class, but other technologies are also quite common, such as run-length encoding and Burrows-Wheeler transform.

Run-length encoding

Run-length encoding is a simple compression technology that replaces multiple repeated characters with the number of duplicates and characters. The number of characters is 1. Rle is very suitable for data with high data duplication. The same row has many progressive images with the same pixel color. It can also be used in combination with other technologies such as Burrows-Wheeler transform.

The following is a simple example of RLE:

Input: aaabbccccdeeeeeeaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa output: 3a2b4c1d6e38a

Burrows-Wheeler transform

Burrows-Wheeler transform was invented in 1994 to process a piece of input data reversely, maximizing the number of consecutive occurrences of the same character. BWT does not compress itself. It simply converts data and enables run-length encoder and other compression algorithms to encode data more effectively.

The BWT algorithm is simple:

  1. Create a string array.
  2. Inserts all the permutation and combination of input strings into the preceding string array.
  3. Sort string arrays in character order.
  4. Returns the last column of the array.

BWT generally processes long strings with many cross-repeating characters with good timeliness. The following is an example with ideal input. Note:


Because the same symbols are exchanged, the input data is optimized after BWT processing. Another algorithm can compress the results, for example, RLE will get "3 H & 3A ". Although this example produces a better result, the data in the real world is not always like this.

Entropy Encoding

In data compression, to represent a character or phrase on average, entropy indicates the minimum number of bits required. A basic entropy encoder includes an analysis model and a set of codes. The input file is parsed and a statistical model consisting of the probability of occurrence of characters is generated. Then, the encoder can use this statistical model to determine the number of BITs for each character, so that the most common characters are encoded in the shortest, and the least commonly used characters are encoded in the longest way.

Shannon-Fano Coding

This is the earliest compression technology, which was invented by clude Shannon and Robert Fano in 1949. One step of this technology is to generate a binary tree that represents the probability of occurrence of characters. The characters are sorted in this way. The more frequently the characters appear, the closer they are to the top of the tree. The more uncommon the characters are, the closer they are to the bottom of the tree.

The encoding of a character is obtained by searching for Shannon-Fano. In addition, the left branch is followed by 0, and the right branch is followed by 1. For example, "A" is two left nodes followed by one right node, and the encoding is "0012 ". Shannon-Fano coding does not always produce optimal encoding, mainly because Binary Trees are built from the bottom up. For this reason, the most frequently used Huffman coding code can be obtained for any input.

The algorithm for generating Shannon-Fano encoding is simple:

  1. Parse the input to calculate the frequency of each character.
  2. Calculate the probability of a character based on the preceding frequency.
  3. Sort characters in descending order based on probability.
  4. Generates a leaf node for each character)
  5. The character list is divided into two parts, so that the probability on the left is roughly the same as the probability on the right.
  6. Add "0" to the left node and "1" to the right node ".
  7. Repeat steps 5 and 6 for the two subtree until all the character nodes become leaf nodes.
Huffman coding

Huffman coding is another Entropy Coding Example. It is very similar to Shannon-Fano coding, just to generate an optimal binary tree constructed from top to bottom.

The preceding three steps for generating the Huffman encoding algorithm are identical to Shannon-Fano:

  1. Parse the input to calculate the frequency of each character.
  2. Calculate the probability of a character based on the preceding frequency.
  3. Sort characters in descending order based on probability.
  4. Generates a leaf node (leafnode) for each character. The node contains the probability information P and stores the node in a queue.
  5. While (nodes in queue> 1)
    • Extract the two nodes with the minimum probability from the queue.
    • Assign "0" to the left node and "1" to the right node ".
    • Create a new node. The probability is the sum of the two nodes in the preceding step.
    • Set the first of the two nodes to the left node of the new node, and the second to the right node of the new node.
    • Store new nodes to the queue
  6. The last node is the root node of the binary tree.
Arithmetic Coding

The algorithm was developed in IBM in 1979, when IBM was investigating some compression algorithms to be used on their mainframe. If we talk about the single compression ratio, arithmetic coding is indeed an optimal entropy coding technology. Generally, arithmetic coding performs better than Huffman coding in terms of compression ratio. However, it is much more complex than other technologies.

Unlike other technologies, arithmetic coding builds the probability of a character into a tree. Arithmetic Coding converts the input to a rational number ranging from 0 to 1. The number of input characters is recorded as base, each character is assigned a value between 0 and base. Then, convert it to binary to get the final result. You can also replace the base value with the corresponding character to get the original input value.

A basic algorithm for calculating the arithmetic code is as follows:

  1. Calculate the number of different characters in the input data. This number is recorded as base B (for example, base 2 represents 2 binary ).
  2. Assign a value between 0 and B to each character in the order in which the characters appear.
  3. Use the Sino-German value in step 2 to replace the characters in the input with the corresponding number (encoding ).
  4. Convert the result obtained in step 3 from B to 2.
  5. If decoding is required, record the total number of characters entered.

The following is an example of encoding. The input is "abcdaabd ":

  1. Find 4 different character input, base = 4, length = 8.
  2. Assign values to different characters in the order of appearance: a = 0, B = 1, C = 2, D = 3.
  3. Replace the character with encoding to get "0.012300134". Note that the first "0." is added to get the decimal number. The last 4 indicates base = 4.
  4. Convert "0.012300134" from 4 to 2 and get "0.011011000001112 ". The last 2 indicates base = 2.
  5. The total number of characters entered in the result is 8.

Assume that the characters are 8 bits, and the input requires 64 bits. However, the corresponding arithmetic coding requires only 15 bits, and the compression ratio is 24%, which has a remarkable effect. This example shows how arithmetic coding compresses fixed strings.

Compression Algorithm Sliding Window algorithmslz77

Lz77 was published in 1977 and is a real compression algorithm. It introduced the 'sliding window' concept for the first time. Compared with several major compression algorithms, the compression ratio has been significantly improved. Lz77 maintains a dictionary and uses a triple to represent offset, run length, and delimiter. Offset indicates the distance from the starting position of the file to the starting position of the current phase. Run Length records the number of characters in the current phase. The Delimiter is only used to separate different phase. Phase is the Separator Used to remove the substring between offset and Offset + length. As File Parsing continues, the sliding window-based dictionary dynamically changes. For example, 64 MB sliding window means that four points will contain 64 MB of input data.

If an input is "abbadabba", the output may be like "ABB (, 'D') (, 'A')", as shown in:


Although the above replacement seems to be larger than the original data, when the input data is larger, the effect will be better.

LZR

LZR is a modified version of lz77, which was invented by Michael rodeh in 1981. The goal of this algorithm is to become a linear time replacement algorithm of lz77. However, after encoding, The Udell pointer may point to any offset of the file, which means a considerable amount of memory is required. In addition, the compression ratio is also unsatisfactory (lz77 is much better). LZR is an unsuccessful lz77 derivative algorithm.

Deflate

Deflate was invented by Phil Katz in 1993 and is the cornerstone of most modern compression tasks. It only integrates two algorithms, first pre-processes lz77 or lzss, and then uses Huffman encoding to quickly get good compression results.

Deflate64

Deflate64 is a patented extension of deflate. It increases the dictionary size to 64 K (with the name), allowing a larger distance in the sliding window. Compared with deflate, deflate64 improves both performance and compression ratio. However, deflate64 is rarely used because of its patent protection and no significant improvement compared with deflate. On the contrary, some open-source algorithms such as lzma are widely used.

Lzss

Lzss, short for Lempel-Ziv-storer-Szymanski, was published by James storer on August 1, 1982. Lzss is improved compared with lz77. It can detect whether a replacement actually reduces the file size. If the file size is not reduced, the input value is not replaced. In addition, the input segment is replaced by (offset, length) data pairs. offset indicates the number of bytes from the start position of the input, and length indicates the number of characters read from the start position. Another improvement is to remove the "next character" information and only use the offset-length data pair.

The following is a simple example with the input "these theses" and the result is "These () s", which only saves one byte, but the effect is much better when the input data is big.


Lzss is still used in many widely used archive formats, the most well-known of which is RAR. Lzss is sometimes used for network data compression.

Lzh

Lzh was invented in 1987 and is called "Lempel-Ziv Huffman ". It is a derivative algorithm of lzss. It uses the Huffman coding to compress the pointer, and the compression effect is slightly improved. However, the improvement brought by the use of Huffman coding is really limited. Compared with the performance loss caused by the use of Huffman coding, the disadvantage is to take.

Lzb

Lzb was also invented in 1987 and is also a derivative algorithm of lzss. Like lzh, lzb also strives to achieve better compression through more effective encoding pointers. It gradually increases the number of pointers as the Sliding Window grows. Its compression is indeed better than lzss and lzh, but it is much slower than lzss because of the additional encoding steps.

Rolz

Rolz stands for "reduced offset Lempel-Ziv", which aims to improve the lz77 compression effect. by limiting the offset size, this reduces the amount of data encoded for the offset-length data pair. This lz77 Derivative Technology first appeared in Ross Williams's lzrw4 Algorithm in 1991. Other implementations include Balz, quad, and rzm. The highly optimized rolz can achieve a compression ratio close to lzma, but rolz is not very popular.

Lzp

Lzp stands for "Lempel-Ziv + prediction ". It is a special case of the rolz algorithm. offset is reduced to 1. There are several derivative algorithms that use different technologies to achieve or speed up compression or increase compression ratio. Lzw4 achieves a digital encoder to achieve the best compression ratio, but sacrifices part of the speed.

Lzrw1

Ron Williams invented this algorithm in 1991 and introduced the concept of reduced-offset Lempel-Ziv compressiond for the first time. Lzrw1 achieves a high compression ratio while maintaining fast and effective performance. Ron Williams also found several other derivative algorithms based on lzrw1 improvement, such as LZRW1-A, 2, 3, 3-A, and 4.

Lzjb

Jeff bonwick released the Lempel-Ziv Jeff bonwick Algorithm in 1998, which is used in the Z File System (ZFS) of the Solaris operating system ). It is considered to be a derivative algorithm of the lzrw algorithm, especially lzrw1, with the goal of improving the compression speed. Since it is used in the operating system, the speed is particularly important. It cannot cause disk operations to become a bottleneck due to compression algorithms.

LZs

Lempel-Ziv-STAC algorithm was invented by STAC electronics in 1994 and used for disk compression software. It is a modified version of lz77, which distinguishes the output text symbol from the offset-length data pair and removes the separator. In terms of functions, LZs is similar to lzss algorithms.

LZX

LZX algorithm was invented by Jonathan Forbes and Tomi poutanen in 1995 for Amiga computers. In LZX, X has no special significance. Forbes sold the algorithm to Microsoft in 1996 and was employed by Microsoft, where it was further optimized and used in Microsoft's cabinet (. Cab) format. This algorithm is also used by Microsoft to compressed the compressed HTML Help (CHM) file, Windows imaging format (WIM) file, and Xbox Live avatars.

Lzo

Lzo was invented by Markus in 1996. The goal of this algorithm is to quickly compress and decompress it. It allows you to adjust the compression level, and requires only 64 KB of extra memory space at the maximum level, while extracting only the input and output space. Lzo functions are similar to lzss, but they are optimized for speed rather than compression ratio.

Lzma

The Lempel-Ziv Markov Chain Algorithm algorithm was first published in 1998 and was released along with the 7-ZIP Archive software. In most cases, it performs better than Bzip2, deflate, and other algorithms. Lzma uses a series of technologies for output. First, a version of lz77 is modified. It operates on the bitwise level, rather than the bytewise level, and is used to parse data. The output after lz77 algorithm resolution is digitally encoded. More technologies can be used, depending on the specific lzma implementation. Compared with other LZ-derived algorithms, lzma significantly improves the compression ratio, thanks to bitewise operations rather than bytewise.

Lzma2

Lzma2 is an incremental version of lzma. It was first introduced in an updated version of the 7-ZIP Archive software in 2009. Lzma2 improves the multi-thread processing function and optimizes the processing of non-compressed data, which slightly improves the compression effect.

Statistical Lempel-Ziv

Statistical Lempel-Ziv was proposed by Dr. Sam Kwong and Dr. Yu Fan ho in 2001. The basic principle is that the statistical analysis result can be combined with the lz77 derivative algorithm to further optimize the encoding that will be stored in the dictionary.

Dictionary algorithmslz78

Lz78 was invented by Lempel And Ziv in 1978. The sliding window is no longer used to generate a dictionary. The input data is either pre-processed to generate a dictionary, or the dictionary is gradually formed during file parsing. Lz78 adopts the latter. The dictionary size is usually limited to a few megabytes, or the maximum number of all encodings is several BITs, such as 8. This is to reduce the memory requirements. How algorithms are processed is exactly the difference between the derivative algorithms of lz78.

When parsing a file, lz78 adds the new character or string to the dictionary. For each symbol, Dictionary records, such as (Dictionary index, unknown symbol), are generated accordingly. If the symbol already exists in the dictionary, the sub-string of the symbol and other symbols after the symbol are searched from the dictionary. The position of the longest-growing string is the dictionary index ). The data corresponding to the dictionary index is set to the last unknown substring. If the current character is unknown, the dictionary index is set to 0, indicating that it is a single character pair. These data pairs form a linked table data structure.

An input such as "abbadabbaabaad" will generate "(0, A) (0, B) (2, A) (0, d) (1, B) (3, a) (6, D. You can see from the example below how it evolved:


LZW

LZW was invented by Terry Welch in 1984 and is called Lempel-Ziv-Welch. It is the most used algorithm in the lz78 family, although it is severely blocked by patents. The LZW Method for Improving lz78 is similar to lzss. It deletes redundant data in the output so that the output does not contain pointers. Before compression, it contains every character in the dictionary and introduces some techniques to improve the compression effect. For example, It encodes the last character of each statement into the first character of the next statement. LZW is common in image conversion formats. It was also used in the ZIP format in the early days and also contains some other professional applications. LZW is very fast, but compared with some new algorithms, the compression effect is relatively mediocre. Some algorithms are faster, and the compression effect is better.

LZC

LZC, short for Lempel-Ziv compress, is a slightly modified version of LZW algorithm used in UNIX compression tools. The difference between LZC and LZW is that LZC monitors the output compression ratio. When the compression ratio exceeds a critical value, the dictionary is discarded and reconstructed.

Lzt

Lempel-Ziv tischer is a modified version of LZC. When the dictionary is full, delete the least commonly used statement and replace it with a new record. There are some other minor improvements, but LZC and lzt are not commonly used now.

Lzmw

Invented by Victor Miller and Mark Wegman in 1984, the lzmw algorithm is very similar to lzt and uses a strategy to replace the least commonly used statements. However, instead of connecting similar statements in the dictionary, it connects the two statements finally encoded by lzmw and stores them as a record. Therefore, the dictionary capacity can be quickly expanded and LRUs are discarded more frequently. Lzmw compression is better than lzt, but it is also an algorithm that is difficult to see in this era.

Lzap

Lzap was invented by James storer in 1988 and is a modified version of lzmw algorithm. AP stands for "All Prefixes" and stores a single statement in the dictionary over time. The dictionary stores all the arrangement and combinations of statements. For example, if the last statement is "last" and the current statement is "Next", the dictionary stores "lastn", "lastne", "lastnex", and "lastnext ".

Lzwl

Lzwl is an LZW modified version invented in 2006. It processes syllables rather than characters. Lzwl is designed for datasets with many commonly used syllables, such as XML. This algorithm usually works with a Preprocessor to break down input data into syllables.

Lzj

Matti Jakobsson published the lzj Algorithm in 1985. It is the only one-digit algorithm derived from lzw in the lz78 family. Lzj stores different strings in the dictionary that are pre-processed and encoded. When the dictionary is full, remove all records that only appear once.

Non-dictionary algorithmsppm

Sampling to predict data is a statistical modeling technique that uses a portion of the input data to predict what subsequent symbols will be. This algorithm is used to reduce the entropy of output data. Unlike the dictionary algorithm, PPM predicts what the next symbol will be, rather than finding the next symbol for encoding. Ppm is usually used together with an encoder, such as arithmetic encoding or adapted Huffman encoding. Ppm or its derivative algorithms are implemented in many archive formats, including 7-zip and RAR.

Bzip2

Bzip2 is an open-source implementation of the Burrows-Wheeler transform algorithm. Its operation principle is very simple, but it achieves a balance between the compression ratio and the speed, which makes it very popular in UNIX environments. First, a run-length encoder is used. Next, the Burrows-Wheeler transform algorithm is added. Then, move-to-front transform is used to generate a large number of identical symbols, prepare for the next run-length encoder. The final result is encoded with Huffman to package a message header with it.

PAQ

PAQ was invented by Matt Mahoney in 2002. It is an improved version of the old version of ppm. The improved method is to use a revolutionary technology called context mixing. Context mixing refers to intelligently combining multiple statistical models (ppm is a single model) to make better predictions for the next symbol, better than any other model. PAQ is one of the most promising technologies. It has a good compression ratio and is very active in development. Since its appearance, more than 20 derivative algorithms have been invented, and some of them have repeatedly broken down the compression ratio. The biggest disadvantage of PAQ is speed, which is due to the use of multiple statistical models to achieve better compression ratios. However, as the hardware speed increases, it may be the future standard. PAQ is slow in application. A derived algorithm called paq8o can be found in a Windows program called PeaZIP. paq8o supports 64-bit and greatly improves the speed. Other PAQ formats are only used for command line programs.

Original article link

History of lossless data compression algorithms

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.