Wen/Xu Jianlin
In the field of real-time multimedia, real-time visual and perceptual display will have more extensive development space, while the basic core technology for real-time video transmission is the coding standard of H. In this paper, the author tries to solve the two problems of the core of the coding and decoding, one is the process of encoding and decoding, and the other is the structure of the code stream, hoping to help the people who are studying this aspect to provide some help.
I left YOLO early this year. Joined a small company in the field of streaming media, responsible for the development of video group chat SDK, YOLO is a live APP, I often jokingly this is from the technology downstream (SDK users) ran to the technology upstream (SDK provider). But it's certainly not so simple, after a long period of thinking and discussion, I finally confirmed: real-time multimedia field, more broadly speaking, real-time vision, perception of the display, in the future for a long time there is a great demand, there are great challenges, so this will be my long-term technology accumulation of the general direction.
After you have defined the general direction, you need to accumulate it tirelessly. I have always stressed the importance of basic knowledge, and recently I took the time to learn the basis of H. S ("New generation video Compression coding standard: H.264/AVC (2nd edition)"), to find out two questions:how the process of encoding and decoding H. What is the structure of the code stream in H.
Before reading the share of only fragmented notes, not dare to publish a blog, this article I strive to according to their own understanding, the above two core issues described clearly, the details of the space is limited, do not start. Interested friends can read the original book, of course, the most authentic information is the H. Video Codec Basics Why the video needs to be encoded.
because the amount of raw video data is too large .
. Bayi Mbps. 640*480*3*8*30 = 210.94 Mbps (width * high * pixel bytes * bytes bits * Number of frames) 1280*720*3*8*30 = 632.81 Mbps
This high rate of bitrate is obviously not directly used. Even with a more space-saving YUV format, the bitrate is still unacceptably high, regardless of whether it is over the network or disk storage, so compression must be encoded. Why video can be encoded.
because there is redundancy in the video data .
The first is data redundancy, there is a strong correlation between each pixel of the image and the frame of the video. For example, a white wall in the picture, the pixel values of each area are very close, such as the daily shooting of the video, the content is basically the same object in different positions to move.
Second, visual redundancy, according to some characteristics of the human eye, such as brightness threshold, visual threshold, brightness and chroma sensitivity, even if the introduction of an appropriate amount of error, will not be detected. What are the main technologies for video coding?
The goal of video coding is to ensure the quality of the video while compressing the data as much as possible. Therefore, the main technology of video coding is to eliminate redundancy and improve the compression ratio. Of course, considering the network environment based on packet switching and the real-time multimedia application scenario, video coding should also consider the problems of network adaptive and fault-tolerant.
Note: The following paragraphs, involving a lot of key technical terminology, no relevant background friends may be no concept, you can read Wikipedia entry to understand its specific meaning, key words: Predictive coding, intra-frame prediction, inter-frame prediction, motion compensation, motion estimation, motion vector, transformation coding, discrete cosine transform, quantization parameters, Entropy coding, Huffman coding, arithmetic coding.
Predictive coding and motion compensation : The predictive coding is designed to eliminate the data redundancy of the video, and after encoding compression, the transmission is not the actual sampled value of each pixel in the image, but the difference between the predicted value and the actual value. Predictive coding is divided into intra-frame prediction and inter-frame prediction, respectively, to eliminate intra-frame redundancy and inter-frame redundancy. To improve efficiency and effectiveness, predictive coding is done for pixel blocks, not pixels. Intra-frame prediction is to predict the pixel block with neighboring pixel blocks, and the inter-frame prediction will first look for the similar block of the pixel block in the neighboring frame, obtain the spatial position offset and then predict. We call the process of finding offsets (that is, similar blocks) called Motion estimates, and the offsets are called motion vectors, and we call this a way of describing the difference between adjacent frames called motion compensation.
Note: What is described here as "prediction", in fact, and "reference" is a meaning that is to find the Reference object and calculate the difference with yourself.
transformation Coding and quantization : The vast majority of images have a common feature, flat areas and slow-moving areas of content occupy a large part of the image, while the detail area and the content of the mutation region is a small part of the image of the DC and low-frequency regions accounted for the majority of high-frequency regions accounted for a small portion. Therefore, it is more advantageous to compress the image from the temporal airspace to the frequency domain. This transformation process, called the transformation of the Code, the most common transformation method is the discrete cosine transform (discrete cosine Transform, DCT). The transformation coefficients are then mapped to smaller values, and the process is called quantization.
Entropy Coding : The encoding of code-rate compression using the statistical characteristics of the source is called entropy coding, also called statistical coding. The high frequency symbol gives the short code, the low frequency symbol assigns the long code, can reduce the whole bit number. The entropy encoding commonly used in video coding has variable length coding (Variable length Coding, VLC, also called Huffman coding) and arithmetic coding (Binary arithmetic Coding, BAC).
The coding framework, such as predictive coding, transformation coding, and entropy coding, has been finalized in the late 70, and until today the H.266 specification is still in use, which is basically in the state of old bottles of new wine for 40 of years, and of course the details are constantly being optimized. The structure of code stream in H
We first understand the code flow structure, and the reason for this design, understand the code flow structure, the process of encoding and decoding has a specific backing. In fact, the specification is also the first regulation of the code flow structure, and then the structure of the decoder (for the structure of the encoder and the implementation of the model no specific provisions), are the same truth. syntactic element layering
In the code stream of the encoder output, the basic unit of the data is the syntactic elements (which can be understood as each basic field of the bitstream structure), the syntactic (Syntax) representation of the syntactic elements, the semantics (semantics), the specific meanings of the syntactic elements, All of the video coding standards regulate the encoder workflow by defining syntax and semantics.
In H. Five, the syntactic elements are organized into sequences, images, slices, macro blocks (macro block, MB), and sub-blocks, as shown in the following illustration: (x).
Tiering is useful for saving code streams, such as sharing information in the next layer, which can be saved on the previous layer, rather than carrying one copy of each underlying structure. However, in the hierarchical structure of H. A, there is no strong dependency between the layers of data organization, which helps to improve the robustness. Because packet switching is prone to errors, if there is a strong dependency, the data behind it cannot be used once the head is lost.
Compared to the previous standard, H + + cancels the sequence layer and the image layer (conceptually, but actually cancels), pulling out most of the syntactic elements originally belonging to the sequence and the head of the image, forming a sequence parameter set (Sequence Parameter set, SPS) and an image parameter set (picture Parameter set, PPS), and the rest of the syntactic elements are put into the sheet layer. Parameter sets are independent units of data and do not rely on other syntactic elements outside the parameter set, which can be transmitted individually and with emphasis on protection.
The hierarchy and layer relationships of the sequence layer and image layer are eliminated as shown in the following figure:
As we can see from the above picture, an image is made up of multiple slices, the piece data will refer to Pps,pps and the SPS will be referenced, and PPS and SPS can be transmitted separately, with emphasis on protection .
What is the structure of the three-layer data of the piece, the macro block and the sub-block? Take a look at the following picture:
Skip_run: When the image is used for inter-frame prediction coding, the "jumping" block is allowed in the flat area of the image, and the "jumping" block itself does not carry any data, and the decoder recovers the "jumping" block by the data of the reconstructed macro blocks around it; Mb_type is a macro block type, such as the macro block of I frame, P Macro blocks of frames (note: For frame types, you can search for related wiki entries, keywords: I-frames, P-frames, B-frames, SP-frames, SI-frames); mb_pred and sub_mb_pred are predictive information for predictive coding processes, such as how macro blocks are divided, reference macro block IDs, etc. residual data (resisual) Is the difference between the predicted block and the block data in the predictive coding process;
The macro block is the basic unit of decoding, and the decoder is decoded according to the predictive information and residual data . Functional Layering
In addition to the syntactic elements of the hierarchy, the function is also divided into two layers: the video coding layer (Coding layer, VCL) and the network abstraction layers (networks abstraction, NAL). The VCL data is the output of the encoding process, it is divided into the above five-layer structure. The VCL data is encapsulated into the NAL unit before being transmitted or stored, and each NAL cell is divided into the original byte sequence load (raw byte Sequence Payload, rbsp) and the header describing the rbsp (that is, the VCL data).
In the packet switched network transmission, the NAL units are individually and completely placed into a grouping, so there is no need for delimiters between the NAL cells, but in the case of disk storage, the NAL unit is stored continuously and a starting code must be introduced to separate the NAL unit. The starting code is a contiguous three-byte data 0x000001, and if the data needs to be on it, add a few bytes of zero fill before the starting code.
To prevent encoding data and start code collisions, define the following "Prevent competition" (emulation prevention, in fact, Escape) rules (00 decoder as NAL unit end, 01 decoder as NAL Unit, 03 for escape, 02 unused):
If these pre-escape sequences are detected after the encoder is encoded, the 0x03 is inserted before the last byte, and if 0x000003 is detected at decoder decoding, the final 0x03 is discarded. With the escape rule above, the decoder can take the data from 0x000001 to 0x000000 as a NAL data unit.
The structure of the NAL unit is shown in the following figure:
Where the NAL type is defined as follows:
From the definition of Nal_unit_type, the basic unit of the encoded data transmission is the slice, while the slice contains the macro block and the sub-macro block. In fact, intra -frame prediction is also limited to on-chip, the different slices are not reference, this is to do in order to limit the extent of error when the error occurs .
Here we can summarize as follows: Thebasic unit of the transmission of the code stream is the NAL unit, the most critical data carried in the NAL unit is the parameter set and the slice data; The basic unit of decoding is the macro block, the decoder decodes the original data according to the forecast information and residual data, and the macro block is decoded into pieces after decoding. Pieces are stitched together into images, and a piece of image constitutes a video.
Here I would like to mention that this is the last layer of the splicing relationship, how to seamlessly link up.
If we design the scheme, we may increase the order of the images, and then the images can be played sequentially; we may add numbers to the slices, and each piece can be stitched together into a single image; We may add numbers to the macro block, so that the macro blocks can be stitched together as a single slice by number.
In fact, the solution of H. Booms: Each image has a sequence of playback (picture order Count, POC), and decoding Order (Frame_num), because the inter-frame prediction has bidirectional predictions, so the decoding order may be different from the play order Each macro block is not numbered because all the macro blocks of one slice are in a NAL cell, they are arranged on demand, no additional numbers are required, each piece is not numbered, but there is a position in the title that represents the first macro block in the image (First_mb_in_slice), So we know where this piece should be placed in the image, and the effect is the same as the number; As for the entire video, the overall information for each image, such as the width and height information, is described in the relevant fields in SPS and PPS. specific syntax and semantics
Originally I want to each layer of the various syntactic elements of a brief once (actually did so too), but it is completely unable to cover the details, and many syntactic elements listed once can not be out of the stack suspicion, so simply deleted a clean. Interested friends strongly recommend reading the original book, or H. x SPEC, as for the video codec-related work of friends, you must be in the syntax and semantics of the chest.
I feel that the whole way of describing the syntax is still very ingenious, it is double benefit to define the data format in the form of decoder pseudo-code. The syntax of H. A is carefully designed, and the syntactic elements that make up the syntax are interdependent and independent of each other. Dependency is to reduce redundant information, improve coding efficiency, and independence is to make communication more robust, in the event of errors to limit the spread of errors. The coding process of H .
The specification does not specify the structure of the encoder and the implementation mode, as long as it produces the code flow structure conforms to the specification, so that the coding process is very flexible.
But its basic structure is the first part of the basic framework we mentioned: Predictive coding, Transformation coding, Entropy coding .
The basic structure of the encoder is shown in the following figure:
one of the most complex expansion space is the process of predicting coding, and the most important in the prediction code is also the most consumption of computing resources, is the motion estimation of the search process .
In addition, regardless of the structure of the encoder, the corresponding video coding control is the core problem of encoder implementation . In the coding process, there is no direct control of the size of the encoded data, only by adjusting the quantization parameters of the quantification of the QP value of the indirect control, and because the QP and encoded data size and there is no definite relationship, so the code rate control of the encoder can not be very fine, basically rely on the test . Either the quality of subsequent macro blocks is changed halfway, or the quality of all macro blocks is changed by recoding. The decoding process of H .
Decoding process is the inverse process of coding: entropy decoding, transform decoding, predictive decoding .
The code specifies the structure of the decoder, so we can more detailed summary of the decoding process: the macro block as a unit, in turn, the entropy decoding, inverse quantization, reverse transformation, to obtain residual data, combined with the prediction information inside the macro block, to find the decoded referenced block, and then combined with the decoded reference block and the block residual data To get the actual data of this block. After the macro block is decoded, the film is combined and then the image is combined.
The basic structure of the decoder is shown in the following figure:
Scalable Coding for H.
The scalable encoding (scalable video Coding, SVC) essentially decomposes the video information by its importance, encoding each part of the decomposition according to its own statistical characteristics. Typically it encodes the video into a basic layer and a set of reinforcement layers. The basic layer contains the basic information, can be independently decoded, the enhancement layer depends on the basic layer, the basic layer of information can be enhanced, the more layers of enhancement, the quality of video information recovery is higher.
SVC usually has three kinds: airspace is scalable: can decode a variety of resolution of the video, time domain scalability: can decode a variety of frame rate of video, the same resolution; Quality scalable: Can decode a variety of bitrate video, resolution, the same frame rate;
The implementation details of the SVC do not expand here, interested friends can consult the relevant information. Summary
In this article, I try to answer two of the most important questions in the decoding process: how to encode and decode the H. What is the structure of the code stream in H.
Confined to space, this article can not be involved in the concepts are described clearly, there is no relevant basis for the reader to consult a lot of professional information, and the relevant foundation of the reader may not need such a summary article, so this article for me to comb their ideas more meaningful, please understand.
Finally, in the AI wave, video codec will certainly be able to combine with AI, in the process of video codec, I think at least the following several links AI can play a great role: in the process of motion estimation, the choice of the search strategy, should be the role of AI can play a part; adaptive chunking, AI may preprocess the image, analyze Image details distribution; coding control: Based on the scene, content, choose coding Strategy, AI can also play a great value;