Implementation of cabac hardware accelerator in H.264 decoder

Source: Internet
Author: User

 

In H.264 decoder, the implementation of cabac hardware accelerator H.264 has two kinds of entropy encoding solutions: one is context-based adaptive variable-length encoding developed from the variable-length encoding solution, and the other
One is context-based adaptive binary arithmetic code cabac developed from arithmetic coding. Compared with cavlc, cabac can save about 7% of the bitstream, but increase the computing speed by 10%.
. When decoding high-definition code streams, complex entropy decoding such as cabac using software cannot complete real-time decoding tasks. Therefore, it is necessary to design a hardware accelerator. The cabac decoding algorithm is in the input code stream of H.264 decoder. The basic unit of data is syntax.
Element), the code stream is composed of syntactic elements in turn. Each syntactic element consists of several BITs, indicating a specific physical meaning. In the code stream defined in H.264, the syntactic element is
Organized into a hierarchical structure, describing sequences, images, slice, macroblock, and sub-macro blocks respectively
(Subblock) The cabac is mainly responsible for decoding the syntactic elements below the slice. The overall process of cabac decoding can be divided into three steps: initialization, binary arithmetic decoding normalization, and anti-binary. Initialization is performed at the beginning of each piece, including the initialization of context variable and the initialization of the decoding engine. Binary Arithmetic decoding and normalized Binary Arithmetic decoding are the core components of cabac decoding. This process decodes 1 bit data and calls each syntactic element for decoding.
Use this process. In H.264, binary arithmetic decoding has three modes: Rule decoding and bypass decoding.
And decode terminate ). When decoding different syntactic elements, one or more of the three modes are called respectively. The anti-binary cabac defines four binary Methods: unary and truncated.
Unary), rank K index Columbus code (Kth Order
EXP-golomb) and fixed length (fixed-length ). A syntactic element can correspond to one or two of the preceding binary methods. However, the syntax element mb_type and
Sub_mb_type deserialization is independent of the above four methods, which are achieved through table queries.
  
Cabac hardware accelerator architecture design H. 264 decoder hardware/software division H. 264 the decoding process adopts a software/hardware decoding scheme. The entire decoder consists of a 32-bit CPU, a DSP-structured computing unit, and a hardware accelerator. Cabac entropy decoding is mainly used for judgment and branch operations. Data interfaces and throughput are not large. These tasks are completed by software and hardware accelerators. The cabac decoding module designed in this article is a cabac hardware accelerator. The overall structure of the cabac hardware accelerator the overall architecture of the cabac hardware accelerator is divided into two layers: The top layer is cabac_top; the bottom layer has seven modules, including
Cabac_center _ control_unit, context, neighbor_mb_information,
Context_init, ac_next _ state_lps, ac_next_state_mps, and rangelps. The cabac_center_control_unit module initializes context model variables, parses syntactic elements, and updates
And transmit the residual data to the IQ & IDCT module. The context module uses dual-port RAM to store 459 context model variables.
Read the context model variables and write the context model variables of another address. The neighbor_mb_information module is SRAM, which stores Macro Block messages.
When parsing the syntax elements in the front Macro Block, the cabac decoder must refer to the above and the left macro block information. Therefore, you need to save the image in this SRAM when the previous macro block and the previous
Macro Block information. This SRAM is updated every time a macro block is parsed. The context_init module is an in-chip Rom used to initialize variables. The three query table modules
Ac_next_state_lps, ac_next _ state_mps, and rangelps are implemented by the combination logic and used for table search operations during Binary Arithmetic decoding. The purpose of this design is to enable the entire H.264 decoder chip to decode HD images (264X1920) in real time. Assume that the chip operates at a MHz frequency and the image playback speed is
25 FPS, the average time to extract a macro block is 823 clock cycles. Considering that the entropy decoding part of H.264 is serial decoding and the concurrency is poor, cabac hardware accelerator needs
Decodes 1 bit data in three clock cycles. Assume that the video image compression ratio is and YUV is. Because the sampling value is 8 bit, each pixel is 8 bit × 1. 5 = 12 bit. The cabac decoding rate is about
Therefore, the code stream to be parsed by cabac is (1920 × 1088 × 12bit/20) × 1. 2, about 1.43 MB. The operating frequency of the chip is 166 MHz, with three clock Solutions
1 bit, the decoded data rate is about 55.3 Mbps. In this design, cabac occupies 90%, which is about 49.8 Mbps. Therefore, the decoding speed is 49.8/1.43, about
34.7fps, that is, 34.7 frames can be resolved in 1 S. It takes about 1920 ms to parse 1 frame (1088 × 28.8. To achieve this goal, the design of the cabac hardware accelerator must optimize the binary arithmetic decoding of the core. According to the features of the normalization algorithm, that is, the number of cycles can be determined by the input codirange,
Codioffset and the codirangelps obtained from the table are determined in advance. Therefore, the binary and normalization steps can be merged to complete the process within one clock cycle. Due to length
Limited. The following uses decoding rules in three modes as an example to describe the hardware of Binary Arithmetic decoding and normalization. For information about bypass decoding and termination decoding, see H.264. The binary arithmetic decoding and normalization processes of Rule decoding mainly include comparison, subtraction, Table query, and shift operations. In H.264, to reduce the computing complexity, cabac first establishes a 64 × 4 second
Dimension Table rangetablps [64] [4], which stores the pre-calculated multiplication results. The entry parameters of the table are pstateidx and qcodirangeidx.
Qcodirangeidx is quantified by the variable codirange. The quantization method is (codirange> 6) & 3. It is based on
The implementation of HDL is as follows: Assign qcodirangeidx = (codirange> 6) & 2' B11;
Always @ (pstateidx or qcodirangeidx)
Begin
Case {pstateidx, qcodirangeidx}
0: codirangelps = 0;
... ...
255: codirangelps = 63;
Endcase
After the end model and multiplication model are established, cabac must save the following variables during the progressive calculation: lower limit codioffset of the current interval, size of the current interval codirange, and
The probability number pstateidx of valmps and LPS (small probability symbol) characters before MPs (high probability symbol. Transidxlps [pstateidx] and
Transidxmps [pstateidx] is two tables with 64 entries in depth. The value of pstateidx is 0 ~ 63. The following is the normalization judgment. When
Codirange must be normalized if it is less than 0x0100. In this way, the binary and normalization steps can be completed within one clock cycle, and the implementation of the Tilde HDL is as follows: Always @ (posedge CLK or negedge RST)
If (! RST)
......
Else
Begin
If (codioffset> = codirange-codirangelps)
Begin
Binval <= ~ Valmps;
Codioffset <= codioffset-(codirange-codirangelps );
Codirange <= codirangelps;
If (pstateidx = 0)
Valmps <= 1-valmps;
Pstateidx <= transidxlps [pstateidx];
End
Else
Begin
Binval <= valmps;
Pstateidx <= transidxmps [pstateidx];
End
While (codirange <0x100) // Note: This statement cannot be integrated.
Begin
Codirange <= (codirange-codirangelps) <1;
? Codioffset <= (codioffset <1) | read_bits (1 );
End
The design of the state machine of the acceleration policy of endcabac the state machine of Binary Arithmetic decoding is the core of this design. The efficiency of this part will directly affect the decoding speed of the cabac hardware accelerator. When the cabac module is not started, the state machine will stay
In the initial state, when a new chip starts, the decoding engine is initialized. When receiving a decoding request from the CPU, the system first enters the pre-decoding state and reads the context model variable, then, go to binary calculation at the next clock.
Decoding status to complete the decoding of 1 bit data. During the cabac decoding process, the system selects the decoding mode based on the types of syntactic elements and the location of the current data. Cabac can decode 1bit data in two steps: Reading, decoding, and updating context model variables. This design uses a two-level pipeline structure. While decoding the current data, the context model variables of the data can be read and removed, thus accelerating the decoding speed. When the dual-Buffer Design for bitstream reading is used for decoding, the dual-buffer format is used to improve the transmission efficiency. When the bus sends data to one buffer, the decoder can read and decode data from the other buffer, so that the data can be transmitted and decoded simultaneously, effectively improving the transmission efficiency. After the design is complete, the design results and performance simulation are tested by using the standard test code stream provided by JVT. The results show that the average decoding of 1 bit data can be completed every two clock cycles. Based on the SMIC 0.18 CMOS process standard unit library, DC (design compile) is integrated. The hardware accelerator area is 38mm2 (excluding the area occupied by off-chip SRAM ), the operating frequency is up to 166 MHz, which meets the expectation. To show the advantages of the hardware accelerator, select the biari_decode_symbol function of jm7.4 to Complete Binary Arithmetic decoding and normalization. Visual c ++
6.0 The compilation result of the compiler shows that the function uses 109 Assembly commands. Therefore, it takes at least 100 clock cycles to complete decoding of 1 bit data using software. However, when the same steps are completed using this design, a maximum of three clock cycles are required, which achieves the role of accelerator.
  
Conclusion: due to a series of optimization schemes and the coordination between the decoding speed and various modules in the decoding system, this paper implements the rapid Decoding of entropy decoding cabac, it can complete real-time decoding tasks of high-definition code streams and has good application value in video decoding chips.

  • Previous Article: prefix and Suffix in cavlc
  • Next article: Anti-assembly
  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.