Parallel algorithm types can be divided into two categories
- Function-level decomposition, parallel according to function module
- Data-level decomposition, parallel based on data partitioning
Function-level decomposition
Function partitioning in the decoding of H. D, e.g. for quad-core systems, each core performs the following tasks separately
- Entropy decoding Framen
- Inverse quantization and inverse transformation framen-1
- Predictive processing Framen-2
- Go to block filter framen-3
This parallel type is the pipelined type, but this type has the following problems in the H. S decoding
- The amount of time spent on each feature is different, depending on the actual stream data, and the throughput of the processing data depends on the most time consuming part. If only the problem of throughput can be buffered to pipeline optimization (once a core has completed a frame of the task, the task results are cached, if the next frame of the previous step has been completed, immediately start the next frame processing), but the problem is in the prediction process, The Inter macro block part needs to be compensated for motion, that is, to rely on other frames, which means waiting for the frame it relies on to complete decoding before it can begin processing the current frame, which makes this pipelined type of parallel method more complex.
- The degree of parallelism is limited, and the parallelism of this parallel type depends on how many parts the function module can be divided into, such as the example above is divided into four functional parts, so that it can only be distributed to four cores for processing.
Due to these shortcomings, this implementation is generally not used in the implementation of the H. T parallel decoder.
Data-level decomposition
According to the data, after each section is divided and processed, it needs to be merged. There are several levels of data partitioning in H. Three, and here are a few of the key analysis
1. Frame-level Parallelism
The H. b is divided into I, p, and B frames, where I and p are used as reference frames, and B. Frames are often used as non-reference frames. Parallel algorithms, for completely unrelated data, can be processed in parallel, that is, B-frames as non-reference frames can be processed in parallel. If this parallel algorithm is used, a kernel is needed to parse the stream, determine the type of frame and assign the frame to different cores for processing.
Its disadvantages are as follows:
- Limited expansion. In general, b frames between p frames are relatively small, so the degree of parallelism is not high.
- B-frames can be used as reference frames in H. s, in which case the Frame-level parallel algorithm will not be able to do so. One solution is to specify that B-frames are not used as reference frames when coding (in fact, x264 does not use B-frames as reference frames when parallel encoding is turned on), but this encoding will cause the code flow to increase, and we are talking about the decoder, the decoder is completely separate from the encoder, It should fully support the features in the standard if it is used as a general-purpose H + decoder.
For these reasons, this implementation is generally not used in parallel decoder implementations.
2. Slice-level Parallelism
In H. x, like other advanced video coding standards, each frame can be divided into one or more slice.
The purpose of the slice is to enhance robustness in the event of transmission errors, and once the transmission is wrong, the slice without error will not be affected, thus reducing the quality of the video display to a relatively limited amount. Each slice of a frame is independent of each other, that is, in entropy decoding, prediction and other decoding operations, the slice is not interdependent between the two. With the independence of the data, they can be encoded in parallel.
Its disadvantages are as follows:
- The number of slice in a frame is determined by the encoder, generally not too much. Most of the videos on the internet are a single frame of slice, which results in the Slice-level parallel algorithm losing its effect.
- Although the slice is independent of each other, it is possible to go beyond the slice boundary when the block filter (deblocking, which is optional at the time of encoding), and the block filter must follow the normal sequence of video sequences, which reduces the speed of the Slice-level parallel algorithm.
- The main disadvantage of using multi-slice encoding is that the video bit rate will increase. In front of the 1th time has been mentioned, is generally a frame only one slice, if the number of slice in a frame increased, the obvious effect is slice boundary increase, in the macro block Intra,inter prediction can not cross the slice boundary. In addition, an increase in the number of slice means there will be more slice_header and starting codes. The relationship between the number of slice and the increase of the stream in the 4 different video at 1080p resolution is shown. At 4 Slice, the code rate is increased by less than 5%, which is acceptable in high bit rate applications such as Blu-ray playback. But more slice will bring a larger stream, remember that the main purpose of video compression is compression, rather than parallel processing, and in many slice cases, if you do not use block filtering will make video quality degradation.
Because of the above analysis, this implementation is generally not used in parallel decoder implementations.
3. Macroblock-level Parallelism
First, the macro block (MACROBLOCK/MB) is not completely independent, in order to be able to do parallel decoding, we need to analyze the dependencies between the macro blocks, find the independent macro block, so as to determine how to parallel.
which can be divided into two kinds of mb-level parallel algorithm
- 2d-wave
- 3d-wave
2d-wave2d-wave algorithm principle
2d-wave is an algorithm for parallel processing of macro blocks within a frame. To implement this algorithm, we must consider the dependencies between the macro blocks within a frame.
In the intra prediction, inter prediction, deblocking and other functional parts of the macro block, it depends on the specific adjacent macro block, and the macro block depends on
When a frame is processed in a single thread, the order of the macro blocks is processed from left to right, top to bottom. In this order, the macro block relies on a macro block that is available when it is decoded. Parallel algorithms require parallel processing of the data are independent of each other, then the independent macro block appears in the following location (knight-jump-diagonal)
In parallel processing, the closest macro block that is independent of a macro block is located in the relative position of (right 2, top 1), so they can be processed at the same time. For example, complete is a macro block that has been processed, and the flower lattice is the macro block that is being processed, and the blank is a macro block that is not processed.
This algorithm is called 2d-wave.
Unlike the Frame-leve,slice-leve parallel algorithm, 2d-wave has good scalability because it has the greatest degree of parallelism, that is, the number of macro blocks that can be processed at the same time is dependent on the width and height of the frame.
$ParMB _{max,2d} = Min\left (\left \lceil mb\_width/2, mb\_height \right \rceil \right) $
The number of macro blocks for frames such as 1080p is ($120 \times mbs$), and its maximum degree of parallelism is $120/2 = 60$.
Disadvantages of 2d-wave
- At the beginning and end of a frame processing, the degree of parallelism is low, and if there is a multi-core situation, there will be a lot of cores out of idle state at the beginning and end.
- In the case of entropy coding for CABAC, the entropy decoding of each macro block in slice is related to its previous macro block, that is, each macro block is related, so the macro block can only be reconstructed after the entropy decoding is complete.
The efficiency of 2d-wave
We discussed the parallel approach in 2d-wave, where the macro block currently being rebuilt needs to be two macro blocks apart from the macro block that the adjacent row is rebuilding, so the following Dag (Directed acyclic Graph) can be obtained by parallel decoding.
is a 5x5 macro block frame of the DAG, horizontal for the time, portrait of each row within the frame, each node represents a macro block decoding, here we assume that each macro block decoding time is the same, and ignore the communication and synchronization of the time taken.
The depth of the DAG (depth), which is the number of nodes from the beginning to the end, that is, the time required to decode a frame, recorded as $t_{\infty}$. The summary number of the DAG, which is the time required to decode a frame using non-parallel decoding, is recorded as $t_{s}$. With the data of these two ideal states, we can get the maximum speed increase when decoding using 2d-wave parallel algorithm.
$SpeedUp _{max,2d} = \frac{t_s}{t_{\infty}}= \frac{mb\_width \times mb\_height}{mb\_width + 2\times (mb\_height-1)}$
Where the maximum degree of parallelism in the decoding process is
$ParMB _{max,2d} = Min\left (\left \lceil mb\_width/2, mb\_height \right \rceil \right) $
According to the above conclusions, we can calculate the common resolution video after using the 2d-wave parallel algorithm, the following parameters
Resolution Name |
Resolution (pixels) |
Resolution (MBs) |
Total MBs |
Max.speedup |
Parallel MBs |
uhd,4320p |
7680x4320 |
480x270 |
129,600 |
127 |
240 |
qhd,2160p |
3840x2160 |
240x135 |
32,400 |
63.8 |
120 |
fhd,1080p |
1920x1080 |
120x68 |
8,160 |
32.1 |
60 |
hd,720p |
1280x720 |
80x45 |
3,600 |
21.4 |
40 |
sd,576p |
720x576 |
45x36 |
1,620 |
14.1 |
23 |
The degree of parallelism change curve is as follows
The horizontal axis is time, the macro block is the unit, 0 is the decoding node of the 0,0 macro block, and the longitudinal axis is the degree of parallelism. We can see that for a frame, the degree of parallelism is increased first and then decreased, and the average degree of parallelism is about half the maximum degree of parallelism.
Of course, the above analysis is based on the "macro-block decoding time is the same" ideal situation. In fact, due to different types of macro blocks and other reasons will make the decoding time of the macro block is not fixed, this will require the synchronization mechanism to adjust the decoding order of macro blocks, it also brings the corresponding overhead, so the theoretical maximum speed can not be achieved.
In order to analyze the speed increase of 2d-wave in the actual decoding, we can carry on the simulation experiment in the actual decoder. As in the FFmpeg decoder, when decoding a frame, we record the time required to decode each macro block in this frame. Combining these time data with the above analysis (the macro block needs to wait for the macro block it relies on to decode it before it can start decoding), we can put together a dag that actually decodes a frame, so we can get the speed increase of 2d-wave in real application.
Speed improvements for different videos see table below
Input Video |
Blue_sky |
Pedestrian_area |
Riverbed |
Rush_hour |
Max.speedup |
19.2 |
21.9 |
24.0 |
22.2 |
The above video is 1080p, the theoretical speed of 32.1, in fact, the average drop of 33%.
3d-wave
2d-wave has been able to get a high enough speed boost, but if decoded in a multi-core processor environment such as the 100 core, 2d-wave is not enough. The 3d-wave algorithm is discussed below, which is well suited for multi-core and even hyper-multicore decoding environments.
The 3d-wave algorithm is based on the phenomenon that there is usually no super-fast motion between successive frames, and the motion vector (MV) is generally smaller. This indicates that when decoding a macro block of the current frame, you do not need to wait for the previous frame to complete decoding, as long as the current macro block depends on the reference macro block is located in the area of the completion of the reconstruction of the current macro block can start decoding. Of course, this algorithm has the same constraints as 2d-wave on the same frame, and 3d-wave can be seen as an upgraded version of 2d-wave.
The 3d-wave algorithm is divided into two implementation modes
1. Static 3d-wave
The maximum length of MV is defined in the H. * Standard (see attachment table A-1, please note that it differs from the maximum search range). When the video is 1080p, the MV maximum length is 512 pixels. The Static 3d-wave algorithm is unified with the maximum value of this MV as the current decoding macro-block dependency, that is, only when the reference frame in the area covered by this MV all the macro blocks are rebuilt, the current macro block reconstruction can begin.
To analyze the static 3d-wave algorithm, we need to do the same as when analyzing 2d-wave, assuming that the decoding duration of each macro block is the same, and that the following scenario occurs when the algorithm is decoded.
- B-frames can be used as reference frames. This means that any frame can be used as a reference frame.
- The reference frame is always the previous frame of the video sequence. This means that the current macro block decoding is only allowed until the macro block of the relevant area has been decoded, so that the current macro block starts decoding more later than the reference frame is not the previous frame.
- Only the first frame is the I frame, and the intra macro block does not appear in the other frames. This means that, in addition to the first frame, the macro block of other frames has a frame-dependent decoding, which means that the decoding of the current macro block will not begin until the relevant macro block on the reference frame on which it depends is decoded.
This scenario is the worst case for the static 3d-wave algorithm, and based on these assumptions, we calculate the degree of parallelism of the static 3d-wave below.
In the case of a maximum MV of 16, if the currently decoded macro block is the second frame of the macro block (0,0), then in the first frame, the maximum MV contains the macro block (0,0), (0,1), (1,0), (in). However, because of the need for sub-pixel reconstruction, the reconstruction interpolation will be used for additional two macro blocks (2,1). So the second frame of the macro block (0,0) depends on the first frame of the macro block (0,0), (0,1), (1,0), (a), (in), (2,1), according to this result, we can know that after decoding the first frame (2,1) to start the second frame (0,0) decoding.
When the maximum MV is 16, the first frame (0,0) differs from the second frame's (0,0) macro block decoding time by 6 units. According to the same analysis method, we can get the maximum MV for 32,64,128, the inter-frame delay is 9,15,27 units respectively. The regular formula is as follows
$ \times [n/16]+3$
As long as the inter-frame delay is guaranteed, subsequent macro-block decoding can be decoded in the 2d-wave way to get the maximum degree of parallelism.
According to the above analysis, the degree of parallelism of 1080p at different maximum MV is as follows
2. Dynamic 3d-wave
Unlike the longest MV in the static 3d-wave, the Dynamic 3d-wave algorithm uses the actual MV in the macro block, so it relies on the reference frame area of the actual MV. As a result, we cannot get the parallelism of the algorithm by simple analysis.
The parallelism analysis of Dynamic 3d-wave needs to be based on the actual code stream, and the dependencies between each macro block can be obtained by tracking the location of each macro block and the MV of each macro block in the code stream. Assuming that the decoding time of each macro block is the same, we can simulate the degree of parallelism of dynamic 3d-wave based on this kind of dependency between macro blocks.
The simulation method is as follows:
When decoding in a decoder such as ffmpeg, each macro block is assigned a value. The first macro block of the first frame is assigned a value of 1, and we can consider the decoding timestamp of the first frame's macro block (0,0) to be 1. In the subsequent decoding of a macro block, we can be based on the above dependency analysis of the decoded macro block of the dependent macro block, and dependent on the macro block of the last decoded macro block decoding timestamp plus 1, is the current macro block decoding timestamp. After the decoding time stamp of the macro block, we can count the number of simultaneous decoding of the macro block on each timestamp, that is, the degree of parallelism.
The following figure shows the degree of parallelism in the decoding process for each video.
One can see
- such as Blue_sky, can have up to 7000 degree of parallelism, than the static 3d-wave (Max mv=16) 2000 parallelism is much higher.
- The degree of parallelism depends on the specific video situation. If there are fast moving objects in the Pedestrian_area, this results in a larger MV, which reduces the degree of parallelism.
- Since the test video is only 400 frames, we see the degree of parallelism as a peak shape. If the video is long enough, you can see a more flat curve at a high degree of parallelism.
In the actual decoding process of the dynamic 3d-wave algorithm, it will also encounter many kinds of constraints, such as communication, synchronization and even more than 2d-wave, the efficiency will be reduced. It is reasonable to assume that these constraints will result in an efficiency decrease of about 50%, higher than the 2d-wave 33%.
Other Data-level Parallelism1. Gop-level Parallelism
Each GOP (Group of picture) is independent of each other, so it can be processed in parallel by the GOP, but this level of parallelism is subject to the following limitations
- Each GOP needs to maintain its own list of reference images (which may also appear in the above 3d-wave), the contents of the reference image list are uncompressed YUV images, and once the degree of parallelism increases, the amount of memory required to maintain the reference image list is considerable.
- The way the GOP is composed is defined by the encoder before encoding, so there will be a variety of GOP forms, and even a whole video sequence with only one GOP, which limits the parallelism of the GOP algorithm to a great extent
2. Sub-block level Parallelism
This parallel algorithm is based on the Macroblock-level 3d-wave, dividing the macroblock into smaller sub-block, except that it is not different. Both are based on the location of the currently decoded Mb/sub-block and MV to find out which region they depend on, and the decoding of the current mb/sub-block can begin when the dependent region is decoded. However, to decode the Sub-block as a decoding task is too small, on the 3.3G Inter Sandybridge processor decoding a macro block on average only need to 2μs, and considering the generation of a new decoding task also takes time, Sub-block The parallel algorithm of level is not very necessary.
For a more detailed analysis of the parallel decoding algorithm at H. Juurlink, refer to Ben: "Scalable Parallel programming applied to H.264/AVC decoding"
Analysis of parallel decoding algorithm in H.