In the scalable Parallel programming applied to H.264/AVC decoding book, the author implements the 2d-wave parallel decoding algorithm based on the dual-chip 18-core cell be system.
Cell be schema
First, let's look at cell be. Cell be is all called the cell Broadband Engine, is a microprocessor architecture, cell processor by Sony, Toshiba, IBM co-development, has been applied to the PlayStation 3. The schema for Cell be is as follows
There are 9 cores in a cell microprocessor, including 1 PPE (PowerPC Element) and 8 SPE (synergistic processing Element). The main purpose of the PPE is to run the operating system and perform various control tasks; The main purpose of the SPE is to operate, but the SPE can only directly process the data in LS.
As a multi-core processor, the cell has a different feature than a common multicore processor: The core internal storage type is not cache, but local storage (LS). In the case of multi-core parallel operation, the common multi-core processor, because of the need to maintain the cache consistency in each processing unit, the more cores, the more overhead, because of the additional overhead of the synchronization cache, and the cell processor uses LS to give the rights of content access in LS to the developer, What data the developer needs to read or write from memory or I/O through DMA. This architecture of the cell processor eliminates the overhead of cache synchronization and allows developers to synchronize the data in LS by themselves, but it requires the developer to have a more in-depth understanding of data synchronization and data consistency.
The LS size in the SPE is only 256KB, so in many cases the data in LS is only a fraction of the data, and the data needs to be read and written continuously through the DMA to ensure that the SPE is not suspended for lack of data. Of course, it is inevitable that the SPE is suspended, but the program should run as smoothly as possible.
2d-wave implementation
Here are two ways to implement 2d-wave:
- Task Pool. A core is required to maintain the macro block pool (decoding the task pool), and other lines Cheng The block pool to decode the macro blocks that can be decoded.
- Ring-line. Each core is responsible for decoding a whole line of macro blocks.
It is important to note that the above-mentioned "decoding" in the process refers to the entropy decoding part, that is, the reconstruction of the macro block. Entropy decoding is not able to carry out data-level parallelism (Data-level Parallelism), because many of the code stream before entropy decoding is variable-length code word, the longer the code word means that only after decoding the previous code word to begin decoding the code word, so the entropy decoding can only be carried out sequentially. In the cell be system, the entropy decoding is done independently by the PPE core. The above-mentioned 2d-wave implementation relies mainly on the parallel operation of SPE to complete.
In the analysis of H. s parallel decoding algorithm, we analyze the 2d-wave algorithm to obtain the parallelism and speed improvement of the algorithm under ideal conditions, but the decoding time of each macro block is different, and there is also the overhead of communication and synchronization. Therefore, an efficient algorithm implementation must minimize these costs to close to the desired efficiency.
The main problems we face can be summed up in two:
- Load balancer (balancing). For example, if the fixed decoding task is scheduled, that is, decoding in a diagonal unit, which will result in a unit of decoding time will depend on the diagonal decoding the longest macro block, the result is that some core will be suspended due to waiting. Therefore, we need to assign the task of decoding the macro block to different cores according to certain conditions to minimize the waiting time of the core.
- Communication and synchronization overhead (communication/synchronization overhead). Because the operations are performed in different SPE, the operations are based on the data in LS, but the LS in different spe is independent of each other, so it is necessary to synchronize the communication with the data in each core. We can treat communication and synchronization as the same thing, because here, the communication is done in order to synchronize between the decoding tasks. Our goal is to minimize this overhead.
TASK POOLTP Structure
The Task POOL implementation structure as shown
- M, main core (PPE). The main core needs to maintain a macro block dependency table to track the dependencies between macro blocks, and once all the dependencies of a macro block in the macro block table are decoded, the macro block can be prepared (ready) and thus be joined to the task queue as a decoding task. Once the idle state is present from the core, the decode task is removed from the task queue and assigned to decode from the core
- P, from the core (SPE). The goal is to decode the macro block, each time the decoding completes a macro block, will enter the waiting state, waiting for the main core to allocate a new decoding task
- Dependencytable, macro block dependent tables. The size of the table is the number of total macro blocks in a frame, and the elements in the table record the number of macro blocks that each macro block depends on. Each time a macro block is decoded by a macro block, the count in the table is reduced by 1, and when the count drops to 0 o'clock, the macro block is ready to be decoded and is added by the main task to the decoding task queue
- The task queue, the macro block, decodes the tasks queues. FIFO, when the macro block is ready to be decoded will join the queue, when the completion of a macro block from the task decoding, the new decoding task will be removed from the queue to decode.
For a macro block depends on the table, each lattice represents a macro block, the number of which is the number of the macro block dependency. The table records the following:
- The upper-left corner is the beginning of a frame of macro block, and does not depend on other macro blocks of the current frame
- The first line of the macro block depends only on the macro block on its left.
- The first column of the macro block depends on its upper left, upper and upper right macro block, but the upper right of the macro block must be later than the upper left and above the macro block decoding, so we can be recorded as relying only on the upper right of the macro block, that is, when the upper right of the macro block is decoded, it can start decoding
- The last column of the macro block depends on its upper left, above, and the macro block on the left, but the macro block on the left is definitely decoded later than the macro block above, so we can record it as relying only on the left macro block, that is, when its left macro block is decoded, it can begin decoding.
- Other macro blocks depend on the macro block has the upper left, upper, upper right and left macro block I, the upper right and left macro block must be later than the upper left, above the macro block decoding, but there is no direct relationship between them, so both are the current macro block dependency
Main core part of TP
Main core pseudo-code:
/* Note that here X stands for the vertical axis, y for the horizontal */for (each frame) {/* starts with a dependent table, adds the first macro block to the queue to be decoded/* Init_dependency_matrix (); decoded_mbs = 0; Enqueue (Ready_q, (0, 0)); /* At the beginning all cores are idle, add them to the core queue */for (i = 0; i<nspes; i++) Enqueue (core_q, i); while (Decoded_mbs < height*width) {/* * * *. If the queue to be decoded is not empty, it indicates that there is a macro block to decode; If the core queue is not empty, the representative has the core being idle; . At this point, you can assign the macro block that you want to decode to the idle core for decoding */while (!) Is_empty (ready_q) | | Is_empty (Core_q)) {(x, y) = Dequeue (ready_q); i = dequeue (core_q); Send_mb_coordinates_to_spe (X,y,i); }/* *. See if each core has a message to return. There is a message returning that the decoding has ended. Then the core reverts to the idle state and joins the core queue waiting for the next task assignment. Gets the decoded macro block from the returned message, and updates the macro block dependent table. If the item dependency of the update becomes 0, it indicates that decoding can begin, so join the queue to be decoded */for (i = 0; i<nspes; i++) {if (Mailbox[i]) {decoded_mbs++; Enqueue (core_q, i); (x, y) = Read_mailbox (i); if (y<width-1) if (--dep_count[x][y+1]==0) Enqueue (Ready_q, (x, Y +1)); if (xTP from the core partFrom the core, that is, the SPE, because the operation can only be accessed directly from LS, it is necessary to use the DMA to the required material from the external memory to read in, until the macro block decoding is complete and then write back to the external memory. Processes such as
5 of these require DMA transfers
- Gets the macro block information that needs to be decoded, that is, the macro block information after entropy decoding
- Get information about the peripheral MB required by the current MB
- If it is a macro block between frames, you need to get the corresponding macro block information in the reference image. To navigate to the reference image, you need a reference image of the current macro block and a motion vector, that is, you must complete step 1 before you can start the operation here
- If loop filtering is used, the unfiltered boundary needs to be preserved for intra-frame prediction of subsequent macro block decoding
- Write back the decoded macro block
SPE Core pseudo-code
while (! Finished) { /* Wait for the main core to start the core for a new decoding task * /(x, y) = WAIT_FOR_NEXT_MB (); /* Get macro block information to decode from external memory via DMA * /H264MB = Fetch_mb_data (x, y); /* Get information about the peripheral macro blocks required by the current macro block from external memory via DMA * /working_buf = Fetch_intra_data (x, y); /* If it is an inter-frame macro block, you need to obtain the pixel data of its reference macro block from external memory via DMA * /ref_data_buf = Fetch_referance_data (x, y); /* DECODE (Rebuild) macro block * /DECODE_MB (x, y); /* Write the decoded macro block back to the external memory via DMA * /WRITE_MB (x, y); /* Notifies the master core to decode end /notify_master (x, y);}
Ring-linerl structureEach core (SPE) in the Ring-line is responsible for decoding a single line of macro blocks, as shown in
Suppose a total of three cores. In the RL scenario, each core is responsible for a fixed row, such as core 0 in charge of the first row, core 1 for the second row, Core 2 for the third row, and then core 0 for line fourth, so loop.
One of the benefits of this implementation is to eliminate the communication overhead required for synchronization between the same line of macro blocks. As shown, because all macro blocks within the same row are uniformly handled by a core in the left-to-right order, the horizontal macro-block dependency has been implicitly included and is represented by a dashed arrow in the diagram. The implementation of the arrow is dependent on the macro block between rows and rows, because the macro blocks of adjacent rows are decoded in different cores, so synchronization between the cores is required through communication.
For a core, the responsibility is to decode the one-line macro block that you are responsible for. When decoding each macro block, you need to wait for the previous line to decode the corresponding macro block notification, after decoding a macro block to notify the next line of macro block of the core.
Data transfer of RLThe ring-line has the same problem as the task pool, which needs to face DMA transfers. Because it is implemented on the cell be, the data communication between the external memory and the local store is unavoidable. Transfer process such as
- After decoding a macro block on the A-core, the B-core is notified to decode a macro block. That is, the A-core-owned, B-core decoding required data through the DMA transfer from A to B-core, that is, the current macro block above the macro Block 4 4x4 block, which can be used to intra macro block reconstruction, get Inter macro block MVP or block filter
- Because one line of macro blocks is uniformly decoded within a kernel, the left data required for the macro block decoding is within the same core LS and does not require DMA transmission.
- Gets the macro block information that needs to be decoded, that is, the macro block information after entropy decoding
- If it is a macro block between frames, you need to get the corresponding macro block information in the reference image. To navigate to the reference image, you need a reference image of the current macro block and a motion vector, that is, you must complete step 3 before you can start the operation here
- B-Core decoding after completion of a macro block need to notify the C-core to start decoding, that is, the C core required for the transfer of the past
- Writes back the decoded macro block. When the block filter is turned on, the right edge filter of the current macro block needs to depend on the macro block of the current macro block, so it can be filtered and written back to the external memory only after the right macro block is decoded.
Data transmission optimization of RLFrom the above process, each decoding a macro block needs to perform a DMA operation or a lot of, however, DMA transfer speed is not as fast as memory read and write, the cost of DMA transmission is not negligible, the SPE core will be stuck due to DMA operation for a long time, so we need to find ways to reduce, Or, hide the overhead of DMA transfers.
According to the above description, the decoding macro block has the following process
There is a need to wait for three DMA transfers to complete for macro block decoding. If the above process is decoded, during the DMA transfer, the processor's SPU (operation) portion is completely idle, we can start from here to optimize the decoding efficiency, that is, we can modify the process, so that the macro block decoding and DMA transmission in parallel, This hides the time overhead required to wait for the DMA transfer to complete. Here are some things to keep in mind when implementing:
- The current macro block to begin decoding, decoding the required material must be prepared first, so the current macro block decoding operation is not possible to obtain the current macro block decoding material of the DMA transmission parallel, can only be used to obtain the previous macro block decoding material of the DMA transmission parallel. Ring-line scheme because each core is responsible for a line of macro block, the current macro block is the first macro block is its left macro block, Task-pool can not adopt this technique is because its macro block decoding order is not fixed.
- As mentioned earlier, after obtaining the entropy decoding information of the macro block (work unit), the reference frame (Reference frame) can be crawled according to the information.
- The same is said before, in the case of block filtering enabled, only after the current macro block decoding is complete, to filter the right edge of the previous macro block, the last to write back.
According to the above analysis, we can draw the following four-level pipeline
On each SPE process, the following actions are required
- Start DMA to crawl MB (x+2) work unit, wait for the DMA crawl MB (x+1) work unit to complete
- Start DMA to crawl MB (x+1) of reference data, wait for DMA crawl MB (x) of reference data complete
- decode MB (x)
- After the MB (x) decoding is complete, the MB (x) and MB (x-1) boundaries are filtered and then written back to MB (x-1)
Through this pipeline optimization, so that the decoding of the macro block while the DMA is also in the data transmission, can be a good solution due to the DMA transmission caused by the waiting problem.
SPE Pseudo-code:
In fact, the DMA transmission is the size and alignment requirements, the following code on the basis of analysis, in order to meet the requirements of the DMA transmission, in the transmission block, that is, intra_data and writeback part of the DMA transmission is not one of the adjacent two macro blocks
/* Note that here X represents the vertical axis and y represents the horizontal */for (each frame) {for (x=spe_id; xPerformance analysis of TP and RL under ideal conditionsThe "ideal situation" here refers to situations where communication and synchronization overhead are not taken into account (different macro block decoding time differences need to be considered).
- In TP implementation, the core begins a macro block decoding requires two conditions: The core is waiting to perform the next decoding task idle state, a macro block is ready to be decoded.
- In the RL implementation, the core begins a macro block decoding requires three conditions: the core is waiting to perform the next decoding task idle state, a macro block is ready to be decoded, ready to decode the macro block is the idle core.
Since RL has a more restrictive condition than TP, TP is theoretically more likely to perform a decoding task than RL, which means that the TP will have a shorter decoding time and higher performance than RL. The following example can well reflect this conclusion
(Please combine with DAG for analysis) above is the time series diagram of TP and RL decoding respectively, and the example in the figure is decoded by two cores. You can see in the TP scheme, after decoding the MB (the first), although MB (0,3) because it depends on the MB (the) has not been decoded, so it cannot start to be decoded, but MB (2,0) is already in the decoding state, so core a can start decoding MB (2,0). In the RL scheme, because each core can only be responsible for a fixed line of macro blocks, so compared to TP will waste more time to wait.
From the whole of time series graph, TP through scheduling makes the macro block more smooth when decoding, the core load is more balanced. Therefore, in theory, TP will be better than RL performance without considering the overhead of communication and synchronization.
The performance analysis of TP and RL in the actual situation
The "actual situation" here refers to the cell be schema described in this article. In the actual implementation, the cost of communication and synchronization must be calculated, Cell be because the multi-core non-shared memory system, in the implementation of the need for a large number of DMA for data transmission, so synchronization and communication costs are not negligible.
As discussed in the RL scheme above, the DMA transport overhead in the cell be system is described and optimized for the RL scheme, but the TP scheme cannot be optimized because the macro block decoding order is not fixed. In the book, we give the average processing time graph of each macro block when TP and RL are actually decoding BlueSky.
It can be seen that the decoding of the macro block of the TP scheme takes much more time than RL, and as the core increases, TP takes longer to grow. In the TP scheme, the cost of DMA transmission and synchronization accounted for a considerable proportion, and the RL scheme was optimized, so it accounted for a small proportion.
Refer to Ben Juurlink for more detailed analysis of the parallel decoding algorithm at H. Mauricio Alvarez-mesa Chi Ching Chi Arnaldo Azevedo Cor Meenderinck Alex Ramirez: "Scalable Parallel programming applied to H.264/AVC decoding"
2d-wave implementation of parallel decoding algorithm (based on multi-core non-shared memory system)