Managing gigabytes-Text compression

Source: Internet
Author: User

As you can see, text compression can be classified into two categories: the symbol method and the dictionary method. The following describes the methods respectively:

1) Symbolic method, symbolwise Method
The common encoding method is that each character uses the same number of digits, such as the ASC code. Each character is 8-bit encoded.
To compress the data, we need to use fewer digits to represent characters. Obviously, we only need to use a small number of digits to represent high probability characters, and a long number of digits to represent low probability characters, so that the data can be compressed on average.

There are two points:
A) how to determine the probability of each character to be encoded is a probability model problem.
The so-called probability model is the probability distribution function provided for the encoder. We must ensure that the encoder uses the same model as the decoder. Otherwise, the solution is incorrect.
There is a lower limit for the number of encoded digits of a symbol, which is related to the consistent information I (s.
I (S) =-logpr [s] (PR [s] indicates the probability that s matches the occurrence)
Therefore, for the fact that the coin is thrown up, at least-log (1/2) = 1 is used for encoding.
For each consistent average information in the alphabet, it is called "entropy", H = sum (PR [s] * I [s]).
It is assumed that the values conform to the hypothetical probability values and H is the lower limit of compression. That is to say, this is a theoretical extreme value, which is practically impossible to achieve.
This is the famous Coding Principle of clude Shannon.

Return to the model, which can be divided into a zero-level model that processes each character as an independent symbol and a m-level finite context model that only considers the finite precursor symbol.
It can also be divided into static models, semi-static models, and adaptive models.
Static models only use fixed models without considering the text being encoded. This model is suitable for tasks with relatively fixed text models.
The semi-static model first generates a model for the file to be compressed and sends it to the decompression party. This method needs to traverse the compressed file twice, so this solution is obviously not good.
Therefore, adaptive models are more popular. Adaptive modeling, that is, starting with a gentle probability model, constantly adjusts as the number of symbols increases.
The adaptive model often solves the zero-frequency problem. There are many solutions to this problem. The simplest is that each character appears once at the beginning.
The disadvantage of this model is that it does not allow random access to the text and must be decoded from the beginning of the text, because we must ensure that the encoding and decoding context are the same.

There are also many studies on how to establish a symbolic probability model. For example, some matching models (prediction by partial matching, ppm) make predictions based on previous characters, and dynamic Markov models are based on finite state machines.
These models are not described in detail. If you are interested, refer to the relevant literature.

B) Know the probability of characters to be encoded and how to assign proper encoding to it. This is the problem of the encoding algorithm.

User ID
To introduce the encoding algorithm, let's first take a look at the Harman encoding algorithm. This is an algorithm developed by the author to avoid taking the subject exam when I was a graduate student. It is very simple and efficient. Once again, I deeply respect and yearn for the education environment in the United States.
The basic idea of this type of algorithm is to allocate less digits to characters with high probability.
This is a problem. If the number of digits of each character is different, how can we know the number of digits of the next character? It must not be specified by Len.
The method of the Haffman is to solve the ambiguity (ambiguity) of the encoding prefix, that is, the prefix of each encoding is different.
So how to generate encoding and how to decode it?
From bottom to top, and decode from top to bottom. It is very easy to describe how to build a user-defined tree.
I once again lamented the clever and simple idea of this algorithm.
For static probability distribution, the User-Defined algorithm is good, but for adaptive models, the User-Defined algorithm consumes a lot of memory or time.
Because the adaptive model uses many different probability distributions at the same time, depending on the context of the encoded text. In this way, we need to build multiple Harman trees at the same time.

 

Arithmetic Coding

Therefore, arithmetic coding is more popular for adaptive models.
Arithmetic encoding is relatively complex, but the significant advantage is that arithmetic encoding can use lower-bit encoding to represent high probability characters, while Harman encoding must use at least one bit for each character.
Arithmetic Coding involves different probability distributions, so it is suitable for adaptive models.
However, for static models, it is still much faster to use the Harman encoding.

The basic idea of arithmetic coding is that, as the appearance of each character continuously decreases the probability range, the high probability character will not greatly shorten the range, the low probability character will generate a small and many 'Next' intervals.
Arithmetic Coding outputs encoding at a time only when the encoding is complete. The encoding value is to select any value in the final probability interval. The smaller the range, indicates that the more digits this value uses.
For example:
Encoding bccb for character sets A, B, and C with only three characters
To solve the zero-frequency problem, we start to assume that A, B, and C appear once.
Before encoding, the probability range is [1/3]. At this time, the probability of ABC is 1/3, so ABC accounts for 0.33 of the probability ranges. For example, the probability range of B is [,].
Start encoding ......
The first character is B, so the probability interval is shortened to [0.33,]
At this time, the ratio of A: B: C is, so the probability interval of each character becomes a = [0.333, 0.416], B = [0.416, 0.583], c = [0.583, 0.666]
The second character is C, so the probability interval is shortened to [0.583, 0.666].
At this time, the ratio of A: B: C is, so the probability interval of each character becomes a = [0.583, 0.600], B = [0.600, 0.633], c = [0.633, 0.666]
The third character is C, so the probability interval is shortened to [0.633, 0.666].
At this time, the ratio of A: B: C is ,......
The final probability interval is reduced to [0.639, 0.650].
So 0.64 can be used as bccb encoding.

The decoding process is to redo it according to the encoding process,
The probability range before decoding is [0, 1]. At this time, the probability of ABC is 1/3, so ABC accounts for 1/3 of the probability ranges. For example, the probability range of B is [0.33,].
So 0.64 falls within the probability range of B, and the first character is B.
The probability interval is shortened to [0.33, 0.333]. The ratio of A: B: C is, so the probability interval of each character is changed to a = [0.416,]. B = [0.416, 0.583], c = [0.583, 0.666]
Therefore, 0.64 falls within the probability range of C, and the second character is C.
...... The decoding is completed.

The number of digits in the interval is proportional to the negative logarithm of the Interval Length. The final Interval Length is the product of the encoded probability value (obviously ).
Log (AB) = Log (A) + Log (B), so the logarithm of the final Interval Length is equal to the sum of the independent probability values of all encoded symbols. Therefore, the contribution of the symbolic s with probability PR [s] to the output encoding is-logpr [s], which is the same as the amount of symbolic information.
Therefore, the number of output digits of arithmetic encoding is close to the optimum.

2) dictionary method, Dictionary Method
The dictionary mode is easy to understand. Replace the original text with the code word in the dictionary. If the code word is less than the number of digits in the original text, the compression is realized. If your dictionary is unique, encryption is implemented if it is not public.
To achieve compression, it is necessary to encode words or phrases. character-based compression is definitely not effective. If the dictionary is static, because the words and phrases in various fields are different, therefore, it is impossible to apply to all texts.
It is also semi-static, that is, every time a dictionary is generated for the text to be compressed, this method is certainly not good.
Therefore, there is an Adaptive Dictionary solution (Adaptive Dictionary scheme), which has the technical content and how to generate an Adaptive Dictionary.
Of course there are many cool people, Ziv and Lempel invented lz77 and lz88 methods.
The methods that cool came up with are simple and easy to use. This is no exception. The basic principle is that the text substring is replaced by a pointer pointing to its previous appearance.
To put it bluntly, the dictionary codebook is the text itself, and the code word is the pointer. I personally feel that when I invented this method, the author may not have thought of any dictionary-based method. It should be the same as that of other scholars.
Let's take lz77 to understand this method,
The output code of lz77 is a series of triplet (A, B, C). A indicates how far forward tracing is, B indicates how long the phrase is, and C indicates the next character to be input.
Why is there a C item for a character that has never appeared before? This design takes care of the need for new characters to be introduced.
Take abaabab as an example. The output code is:
(0, 0, A) (0, 0, B) (2, 1, A) (3, 2, B)
This method needs to find the maximum matching window in the previous text. This method cannot be used for linear search. It is generally implemented using a hash or binary search tree.

The famous Gzip is a variant of this algorithm,
Gzip uses hash table to locate the previous string and uses three consecutive characters as the hash key value. The chain table records the position information of these three characters in the window.
For efficiency, the length of the linked list is limited. In fact, there is no need to record the distant locations. It is reasonable to remember the closer locations.
What is interesting about Gzip is that the pointer offset value is also encoded by a user. The commonly used values are encoded less. In addition, the length value of the string is also matched using the Harman encoding.
If no match is found before, a raw character is passed. Interestingly, the original character and string matching length value share the same encoding.
That is to say, the second item may be a string that matches the length value, or it may be an original character. The prefix encoding will not conflict with each other, and it will be used for compression.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.