I. Video Information and Signal Features
1.1 intuition
Use a human's visual system to directly obtain video information
1.2 certainty
The video information is specific and not confusing with other content.
1.3 Efficiency
With the help of the visual system, people can observe the pixels of images in parallel, so it is highly efficient.
1.4 extensiveness
Visual systems account for 70% of the total external information
1.5 high bandwidth of Video Signals
Video information contains a large amount of changed information, which requires a relatively large bandwidth for the transmission network.
Ii. requirements and possibilities of video compression
2.1 target of video compression and encoding
Due to the large amount of video information and high transmission bandwidth, you must first compress the video source and then transmit it to save bandwidth and storage space.
(1) The video must be compressed within a certain bandwidth and a sufficient compression ratio should be ensured.
(2) After video compression, the restoration must ensure certain video quality.
(3) Video Encoder implementation methods should be simple, easy to implement, low cost, and high reliability.
2.2 possible lines of video compression
(1) time-related rows
In a video sequence, adjacent two adjacent frames have very few differences. This is the time correlation.
(2) Spatial correlation
In the same frame, there is a large correlation between adjacent pixels. The closer the two pixels are, the stronger the side correlation.
Iii. Video Encoding Technology
3.1 Basic Structure
The video encoding method is related to the source used. Based on the source model, video encoding can be divided into two categories: Waveform-based encoding and content-based encoding.
3.2 Waveform-based coding
If the source model of "an image consists of many pixels" is used, the parameters of this source model are the brightness and the amplitude of the color of the pixel, these parameters are encoded Based on waveforms.
The pixel space correlation and the time correlation between frames are used. The prediction encoding and change encoding technologies are used to reduce the video signal correlation, significantly reduce the bit rate of the video sequence, and achieve the goal of compressing the encoding.
3.3 Content-based encoding
If a source model composed of several objects is used, the parameters of this source model involve the shape, texture, and motion of each object. Content-based encoding is used to encode these parameters.
Iv. Application of h264
The technical features of 4.1 H.264 can be summarized into three aspects.
(1) Pay attention to practicality;
(2) Pay attention to the adaptation to mobile and IP networks;
(3) The main key components of the hybrid encoder are greatly improved under the basic framework of the hybrid encoder, such as multi-mode motion estimation, intra-frame prediction, multi-frame prediction, content-based variable-length encoding, and 4x4 two-dimensional integer transformation.
(4) It is necessary to measure the implementation difficulty while focusing on the superior performance of H.264. In general, H.264 performance improvement is achieved at the cost of increasing complexity. It is estimated that the computing complexity of H.264 encoding is about three times that of H.264, and the decoding complexity is about two times that of H.264.
4.2 h264 applications can be divided into three levels:
(1) Basic grade: (simple version, wide application, supports intra-and inter-frame encoding, and based on variable entropy encoding .)
Applications: real-time communication, such as video sessions, conference TVs, and wireless communications.
(2) Major grades: (a number of technical measures are adopted to improve image quality and increase compression ratio, support for interlace videos and context-based adaptive arithmetic coding .)
Application: digital broadcast and digital video storage.
(4) extended grades: application fields: Video Stream Transmission and on-demand video on various networks
V. video encoding principles
5.1 Basic Concepts
(1) The video encoder can compress an image or a video sequence to generate a code stream.
In, the frame or field Fn input by the encoder is processed by the encoder in the unit of Macro Block.
If interframe prediction encoding is used: the predicted P value is obtained by reference of the encoded image after motion compensation. The prediction image P is subtract from the Fn of the current frame, and the residual difference Dn of the two images is obtained. The Dn is converted to T, Q is quantified, and space redundancy is eliminated. The coefficient X is obtained, sorts X (to make the data more compact) and Entropy code (to add motion vectors... Obtain the nal data.
The re-encoder has a reconstruction process (decoding process), quantization factor X, reverse quantization, reverse transformation, get Dn ', Dn' and prediction image P, get uFn ', filter to get Fn, and Fn is the image obtained after Fn encoding and decoding.
If the intra-Frame Prediction code is used: predicted P, which is calculated from the macro block that has been encoded in the current video (brightness: 4x4 or 16x16, color 8 × 8 prediction ). The block to be processed, minus the predicted value P, gets the residual value Dn. The Dn is converted to T, quantizes Q, obtains the coefficient X, rearranges X (makes the data more compact), and Entropy code, obtain the nal data.
During re-reconstruction, the quantization coefficient X, inverse quantization, and inverse transformation are used to obtain the sum of Dn ', Dn' and prediction image P to obtain the decoded value of the front macro block, this value can be used as a reference Macro Block for intra-frame prediction.
The reason why the encoder needs to have a reconstruction mechanism: the reconstruction process is actually a decoding process. The decoded image and source image must be different. We will use the decoded image for reference, it can be consistent with the value in the decoder to improve the accuracy of image prediction. (In the decoder, it uses decoded images for reference and uses decoded images to predict the next image)
(2) The Video Decoder can decode a code stream and produce images or video sequences with the same length as the source image or source video sequence. If the decoded image is the same as the source image, the encoding/decoding process is lossless. Otherwise, the decoded image is lossy.
The decoder implementation is the same as the reconstruction mechanism of the encoder.
(3) field, frame, and image
Field: The image scanned by the same line. Even rows become the top rows. The odd number of rows is the base field. All the top fields are called the top fields. All bottom fields are called bottom fields.
Frame: The image scanned row by row.
Image: both the field and frame are considered as images.
(4) macro blocks and slices:
Macro Block: a macro block consists of a 16x16 brightness block, an 8x8 CB, and an 8x8 Cr.
Slice: an image can be divided into one or more slices. A slice consists of one or more macro blocks.
5.2 encoding data format
5.2.1 h264 supports the encoding and decoding of consecutive or barrier videos.
5.2.2 h264 encoding format
There are two main objectives for the formulation of h264:
(1) high Video Compression Ratio
(2) It has good network affinity and can be adapted to various transmission networks.
Therefore, h264 functions are divided into two layers: Video Encoding layer (VCL) and network extraction layer (NAL)
VCL data is the video data sequence after compression and encoding. VCL data can be transmitted or stored only after it is encapsulated in the NAL unit. The NAL unit format is as follows:
Nal header Rbsp
5.2.3 h264 code stream Structure
5.3 reference images
To improve the precision, H264 can select a maximum of 15 images and select the best matching image.
Advantage: greatly improves prediction accuracy
Disadvantage: the complexity is greatly increased.
The reference images are managed by the reference list (list0, list1,
P frame has a reference list list0
Frame B has two reference lists: list0 and list 1.
Prediction within 5.4 Frames
The prediction block P is formed based on the encoded reconstruction block and the current block.
Brightness prediction: 4x4 brightness prediction, 16x16 brightness Prediction
Color Pixel prediction: 8x8 color Prediction
5.4.1 4 × 4 brightness Prediction
There are 9 kinds of prediction modes for 4x4 brightness Prediction
(A) Use the pixels above and left ~ Q: Perform intra-frame 4 × 4 prediction.
(B) 8 directions predicted by 4x4 in a frame
Mode description
Mode 0 (vertical) pushes the corresponding pixel value vertically from the above Pixel
Mode 1 (horizontal) releases the corresponding pixel value horizontally from the left Pixel
Mode 2 (DC) releases all pixel values from the average values on the top and left
Mode 3 (lower left diagonal) returns the corresponding pixel value by interpolation of pixels at 45 degrees.
Mode 4 (lower right diagonal line) returns the corresponding pixel value by interpolation of 45 degrees of pixels.
Mode 5 (right vertical) gets the corresponding pixel value by interpolation the pixel value in the 26.6 degree direction
Mode 6 (lower level) gets the corresponding pixel value by interpolation the pixel value in the 26.6 degree direction
Mode 7 (Left vertical) gets the corresponding pixel value by interpolation the pixel value in the 26.6 degree direction
Mode 8 (upper level) gets the corresponding pixel value by interpolation the pixel value in the 26.6 degree direction
The corresponding prediction blocks produced by 9 prediction modes (SAE defines the prediction error for each prediction), where the smallest prediction block of SAE matches the current block most.
5.4.2 16x16 brightness prediction mode-4 prediction modes in total
Mode description
Mode 0 (vertical) releases the corresponding pixel value from the above Pixel
Mode 1 (horizontal) releases the corresponding pixel value from the left Pixel
Mode 2 (DC) releases the corresponding pixel value from the average value of the top and left pixels.
Mode 3 (plane) uses the linear "plane" function to produce the corresponding pixel value, which is suitable for areas with gentle brightness changes.
5.4.3 8 × 8 color block prediction Mode
Four prediction modes, similar to intra-frame 16*16 prediction, but with different numbers
DC is Mode 0, horizontal is Mode 1, vertical is Mode 2, plane is Mode 3
Prediction between 5.5 Frames
H264 inter-Frame Prediction uses encoded frames or fields and block-based motion compensation. In h264, the block size is more flexible (16x16 to 4x4 ).
5.5.1 Basic Concepts
There is a correlation between the scenes in the adjacent frames of the active image. Therefore, the image is divided into several blocks or macro blocks, and the position of each block or macro block in the adjacent frame image is searched, and obtain the cheap space between the two. the relative offset is the motion vector (MV ).
The process of vector motion is called motion estimation (me ).
5.5.2 tree Motion Compensation
The brightness of each Macro Block (16 × 16) can be divided by four methods: 1 16 × 16, 2 16 × 8, 2 8 × 16, 4x8. The sub-blocks in the 8x8 mode can be further divided into: 1 8x8, 2 4x8, 2 8x4, and 4 4x4. This split motion compensation is called tree motion compensation.
Tree motion compensation, flexible and meticulous division, greatly improving the accuracy of Motion Estimation
The block size is variable. You can flexibly select the block size during motion estimation. In the Macro Block (MB) Division, H. 264 adopts four modes: 16 × 6, 16 × 8 × 16, 8 × 8. When the 8 × 8 mode is adopted, 8 × 4 × 8 can be further adopted, in this way, we can further divide three sub-Macro Block partitioning modes, which can make the division of moving objects more accurate, reduce the convergence error of the edge of the moving object, and reduce the calculation amount in the transformation process. When Intra_16 × 16 Inter-frame prediction is used for large smoothing areas, H. 264 uses 16 4x4 DC coefficients of the brightness data for the second 4x4 transformation, the DC coefficients of four 4x4 blocks of the color data are transformed by 2x2.
5.5.3 Motion Vector
Each sub-Macro Block of the inter-frame encoding macro block is predicted from a same size Area of the reference image. The difference between the two (MV) is that the brightness component is accurate to 1/4 pixels, and the color is accurate to 1/8 pixels.