Stupid Data Compression tutorial-Chapter 2 technical preparation: probability, model, and encoding

Source: Internet
Author: User

What is entropy?

Data Compression not only originated from the Information Theory pioneered by clude Shannon in 1940s, but also its basic principle is how small the information can be compressed. So far, it still follows a theorem in information theory, this theorem uses the term "Entropy" (Entropy) in Thermodynamic to indicate the actual amount of information to be encoded in a piece of information:

Consider using a binary number consisting of 0 and 1 to encode a piece of information containing n symbols. Assume that the probability of repeated occurrence of the symbol Fn in the entire information is Pn, the entropy of the symbol indicates that the bit digits required by the symbol are:

En =-log2 (Pn)

The entropy of the entire information indicates the number of digits required for the entire information: E = Σ En

For example, the following string contains only three characters: a B c:

Aabbaccbaa

The string length is 10, and the character a B c appears 5 3 2 times, the probability of a B c appearing in the information is 0.5 0.3 0.2, and their entropy is:

Ea =-log2 (0.5) = 1
Eb =-log2 (0.3) = 1.737
Ec =-log2 (0.2) = 2.322

The entropy of the entire information, that is, the number of digits required to express the entire string:

E = Ea * 5 + Eb * 3 + Ec * 2 = 14.855 bits

Recall that if we use the commonly used ASCII code in the computer, we need a full 80 bits for the above string! Now I know why the information can be compressed without losing the original information. To put it simply, a smaller number of digits is used to represent symbols that frequently appear. This is the basic principle of data compression.

Careful readers will immediately think about how to use a binary number like 0 1 to represent the binary digits at zero? It is really difficult, but there is no way. Once we find a way to accurately represent the few binary bits at zero, we have the right to challenge the lossless compression limit. Don't worry. You can see chapter 4.

Model

From the above description, we understand that to compress a piece of information, we must first analyze the probability of each symbol in the information. Different compression programs use different methods to determine the probability of occurrence of symbols.
The more accurate the probability calculation, the easier it will be to get a good compression effect. In a compression program, a module used to process input information, calculate the probability of a symbol, and decide which or which code to output is called a model.

Is there a variety of compression models that make it difficult to estimate the probability of occurrence of Characters in the information? Isn't it easy for us to know the probability of each character on the above string? Yes, no
The preceding string is only 10 characters long. That's just an example. Consider the files we need to compress in reality. Most of them are dozens of K or even hundreds of K long, several M
Isn't byte files common?

Yes, we can scan all the characters in the file in advance to calculate the probability of each character appearing. This method is called "static statistical model" in terms of compression ". However, the characters in different files are
For different distribution probabilities, we either spend a lot of time counting the character probabilities in all the files we want to compress, or save a probability table for each individual file for decompression. Worse, no
However, scanning a file takes a lot of time, and saving a probability table also increases the size of the compressed file. Therefore, in practical applications, the "static statistical model" is rarely used.

Most real compression programs use an adaptive model. An adaptive model can be said to be an automatic machine with learning functions. He has nothing to do with the information before the information is input.
The probability of occurrence of each character is equal. As the character is continuously input and encoded, the operator counts and records the probability of occurrence of the character and applies the probability to the encoding of subsequent characters. That is, adaptive
The compression effect of the model is not ideal at the beginning of compression, but with the compression, it will get closer and closer to the accurate value of the probability of characters, and achieve the desired compression effect. The adaptive model can also adapt to character distribution in input information.
Can adapt to the character distribution in different files without the need to save the probability table.

The model mentioned above can be collectively referred to as "statistical model", because they obtain the probability of a character based on the statistics of the number of occurrences of each character. Another type of model is called "dictionary model ". In fact, when
When we mention the word "ICBC" in our daily lives, we all know that it refers to "ICBC". There are many similar examples, but the premise is that we all have a common abbreviation.
. The dictionary model does not directly calculate the probability of occurrence of a character, but uses a dictionary. With the reading of input information, the model finds the longest string that the input information matches in the dictionary, then output
The index information of the string in the dictionary. The longer the matching, the better the compression effect. In fact, the dictionary model is basically based on the calculation of the probability of characters, but the dictionary model uses the matching of the entire string instead.
The number of repeated characters. It can be proved that the compression effect obtained by the dictionary model still cannot exceed the entropy limit.

Of course, for common compression programs, the space needed to save a large dictionary is still intolerable. Besides, any pre-defined dictionary cannot adapt to data changes in different files.
Status. By the way, the dictionary model also has an "Adaptive" solution. With the continuous input of information, we can create a proper dictionary from the input information and constantly update this dictionary to adapt to the changing data.
.

Let's take a look at the adaptive model from another perspective. Cluade Shannon tried to use a "party game "(
Game) to determine the actual English information capacity. Each time he publishes a message to the audience that is hidden from him, the audience is asked to guess what the next character is. However
After, Shannon
Use the number of guesses to determine the entropy of the entire information. In this experiment, a model for estimating the probability of the next character based on the previous character exists in the mind of the audience, which is more suitable than the adaptive model used in computers.
In addition to the number of occurrences of characters, the audience can also guess their language experience.

Encoding

Through the model, we have determined the number of bits to be encoded for a symbol. The problem now is how to design an encoding scheme so that it can use the number of digits calculated by the model to represent a symbol as accurately as possible.

The first problem to be considered is that if three binary bits are used for a, then 4 is used for B.
It can be expressed as a binary bit. In decoding, How do I know which three are a and which four are B in the face of a series of binary streams?
What about it? Therefore, a encoding method must be designed so that the decoding program can easily separate the encoding part of each character. So there is a technology called "prefix encoding. The dominant idea of this technology is that any word
Is not the prefix of another character encoding. In other words, the encoding of any character is not based on the encoding of another character plus a number of digits 0 or 1.
. Let's take a look at the simplest example of prefix encoding:

Symbol Encoding
A 0
B 10
C 110
D 1110
E 11110

With the above code table, you can easily identify the real information from the following binary stream:

1110010101110110111100010-DABBDCEAAB

The next question is: prefix encoding like above can only represent the symbols of the entire digit, and the number of digits of the symbol can only be output in an approximate integer. How can we output the number of decimal places? Scientists solve this problem with arithmetic coding. We will discuss the arithmetic coding in Chapter 4 in detail.

To sum up

Different models use different methods to calculate the probability of occurrence of characters, so that the entropy of the characters can be obtained. Then, different encoding methods are used to try to approach the expected entropy value. Therefore, compression efficiency
The result depends on whether the model can accurately obtain the character probability, and on whether the encoding method can accurately output the character code with the expected digits. In other words, compression = model +
Encoding. As shown in:

--------- Symbol ---------- probability ---------- code ----------

| Input | --------> | model | --------> | encoding | --------> | output |

---------------------------------------

Resources

We already know that writing a compression program often does not process the entire byte of data, but reads and writes and processes data in binary bits, binary functions have become the most common tool functions in compression programs. We provide two sets of function sets, which can be used to effectively perform binary operations in files or memory. There are six files in total:

Bitio. h-Function Description for binary operations in files.
Bitio. cpp-function implementation for binary operations in files.
Error handling functions used in errhand. h and errhand. cpp-bitio. cpp.
Wm_bitio.h-Function Description for binary operations in the memory.
Wm_bitio.cpp-function implementation for binary operations in the memory.
They are packaged together in the bitio.zip file.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.