[Zz] video decoding Optimization

Last Update:2018-12-03 Source: Internet

Author: User

Tags prefetch

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The following analyzes some experiences to learn about video decoding optimization:
1. Implement MPEG4 video decoding in an Embedded System
There are two feasible methods
(1) Adopt FFMPEG (the core of mplayer is FFMPEG), and then decode FFMPEG MP4.

1) Compile IDCT and optimize the implementation of VLD-> inline & assembler
2). Group the data in MB based on the size of the arm9-cache & cache line so that multiple MB can be processed at the same time each time.
VLD ---> IDCT --> MC -- ......> Coupling
3). Optimize memory access (MC) for key code segments-> inline & assembler
4). Do not use the built-in img_convert () of FFMPEG for yuv2rgb conversion-> inline & assembler
5) Optimize the arm Instruction Set of the decoding database-> optimize the architecture
Configured FFMPEG with CPU = armv4l wocould give you a better performance
If you have IPP, you can enable it, you can obtain huge enhancement
IPP = intel? Integrated performance primitives Intel High-Performance Component Library (only for XScale)

(2) Xvid is used. FFMPEG contains too many decoding libraries. If you only perform mppeg4 decoding, why use such a complex library.
BTW is the best version of XviD 0.9.2 in embedded systems. Because version 1.1.0 contains many as features
Embedded systems are not required and cannot be implemented easily. If you want to develop your own Encoding algorithms, you cannot always rely on others.
, It is best to spend time on implementation and optimization. Therefore, I think the xvid0.9.2 ratio is based on the actual situation.
1.1.0 is good. In fact, on a platform with a clock speed of MHz, We need to optimize the Xvid algorithm to achieve real-time CIF decoding.
It is very easy. It can be more than one month at most.

2. Impact of video decoding process on decoding
Video decoding optimization generally involves a large amount of code, and the source code is often obtained from other places, so reading is difficult.
Not to mention optimization. We have been optimizing realvideo recently with several tips:
1) before reading the code, you must first familiarize yourself with the process and grasp the key points, such as video decoding, entropy decoding, reverse quantization, and reverse variation.
Change, interpolation, reconstruction, filtering, reference frame insertion, etc. By grasping these points, you can quickly separate the code.
2) analyze the decoding process to understand the minimum buffer required for decoding and the bit width of each buffer.
3) track the buffer flow of the code based on the known process and check whether there are excess memory copies. Try to set the buffer
Experience shows that the speed improvement brought about by the reduction of buffer is far greater than the optimization of local algorithms. -> Coupling
4) check whether the program structure order is reasonable. an unreasonable program structure will lead to an increase in the buffer.
After studying the video decoding sequence in the past two days, it is found that the efficiency of interpolation first and then inverse transformation is much higher than that of inverse transformation first and then interpolation,
The reason is that the bit width after interpolation is 8 bits, but usually 9 bits after inverse transformation. Therefore, the value after interpolation must be saved before reconstruction.
This saves half of the space than the value after the reverse transformation. In this way, the access to memory is much less during reconstruction. As far as I know,
Most highly efficient decoders are interpolation first and then reverse transformation, and reconstruction is performed immediately after transformation, which reduces memory usage.
To avoid too much memory access jitter and ultimately reduce cache miss. -> Modify decoding process coupling

3. Impact of the cache mechanism on decoding
Look at http://www.hongen.com/pc/diy/know/mantan/cache0.htm first
Write-through (direct write) and write-back (back-to-write) have different operations. In different scenarios, different memory blocks are used.
Different write-back policies (if implemented by your system) are much more efficient than using one policy. Specifically,
The memory block that is repeatedly accessed is set to write back, and the memory that is used after a long write is set to write-through.
Greatly improve the cache efficiency.
The first point is easy to understand, and the second point needs to be considered. Because the write-through operation is that when the cache contains data of this address
The cache and primary storage are updated at the same time. When the cache does not have the address data, it is directly written to the primary storage, ignoring the cache. When the data of this address is
If the data is used after a long time, the data is definitely not in the cache (replaced ).
Directly writes data to the primary storage;
On the contrary, if you use the write-back operation, when the cache contains the address data, you need to update the data and set the dirty bit, which is very long.
After the time, the data will be used or replaced before it will be flushed into the primary storage, which is suspected of occupying the pitfalls.
When the cache does not have the address data, the situation is even worse. First, you need to export the corresponding primary storage data (a cache line)
Enter the cache, update the data, set the dirty bit, and wait for the memory to be refreshed. In this case, the cache is not empty.
The process of importing data from the primary storage also occupies the bus, causing great overhead. Why do we need to store data from the primary database first?
Data is imported because the cache writes data back to the primary storage according to a cache line unit, but is updated.
There may not be so much data in the cache line, so to ensure data consistency, you must first import the data
Cache, which is updated and then flushed back.
For many video decoding, the frame write process is a one-time action, which is only available for the next reference frame.
Therefore, the frame buffer memory can be set as a write-through operation, which may be used as a reference for the next time.
Frame, but as the reference frame does not require repeated access, only one read operation is required, so the efficiency is not due
To reduce the bandwidth without passing through the cache. Experiments show that this method can improve the efficiency of MPEG4 SP decoding by 20-30%.
Similar content cache operations include prefetch operations, which import data from the primary database to the cache.
During this period, the CPU does not need to wait and the next command is executed. If the next command is also a bus operation
You must wait until the prefetch is complete before starting. Therefore, when using this command, insert the prefetch command as much as possible
If the number of clock commands is greater than the number of clock commands required for a cache miss, the prefetch and Subsequent commands can be executed in parallel.
Line, thus eliminating the waiting process, which is equivalent to offsetting the loss of cache hits. Of course, if too many commands are inserted
The cache is too small. It is possible that the prefetch data is replaced after it enters the cache. Therefore, you need to evaluate it yourself.
-> Cache Optimization

4. Summary
IDCT is the first step in key video decoding steps. Currently, it is generally implemented using a fast algorithm, such as the Chen-Wang algorithm,
The effect of C language and assembly is quite different.
Perform IDCT transformation on an 8x8 block, as shown in figure
For (I = 0; I <8; I ++)
Idct_row (block + 8 * I );
For (I = 0; I <8; I ++)
Idct_col (block + I );
After the compilation, it can reduce the memory bandwidth, improve the storage efficiency, and avoid unnecessary memory read/write.
Mplayer has made a lot of efforts in this regard. For armv4 (S3C2440 belongs to the armv4l architecture ),
In the dsputil_arm_s.s file. Unfortunately, it contains an instruction PLD, which is not supported by cache prefetch command 2440.
. The PLD command is an enhanced DSP command, which is supported only when armv4e (E represents enhanced DSP ),
The code running on the orchid must comment out this instruction; otherwise, the Code cannot be compiled, and the thread will be transferred back.
Frequency compression and variable-length decoding through VLD (variable lenght decode) to obtain DCT data. This part of work is generally through
Query tables to accelerate performance. All encoding tables are saved in advance. The code for retrieving video bit streams is usually a macro.
Extension to achieve the same effect as assembly. After IDCT, there are two key steps: Motion Compensation and color space conversion. Pair
The acceleration of motion compensation is also achieved through assembler, and its code is also put in dsputil_arm_s.s. It is necessary to mention that
If there is a SIMD command, it will greatly increase its speed.
Color Space conversion is the most important step after decoding the output. In embedded systems, rgb565 is usually used
16 bits are used to represent the color of a pixel.
An 8x8 block. Its YUV (420 format) is as follows,
Yyyyyyyy
Yyyyyyyy
Yyyyyyyy
Yyyyyyyy
Uuuuuuuuu
Vvvvvvvv
Note that the value is 8 bits. The pixel value can be obtained through the calculation of the load-in equation. In the implementation process, we usually use a look-up table to accelerate
Computing: Each y, U, and V has a corresponding table. For a video of 76800 X, a total of pixels are displayed. If every
The number of pixels saves 10 cycle in this conversion, and the CPU saved is quite impressive.
After the color space is converted, the picture is copied to the memory of framebuffer.
The copy time of the slice. There are two aspects to note: First, someone has implemented the function of directly sending the converted memory to framebuffer,
Reduce the final copy process. This idea is really good, but it requires some skills to implement it. The second is the copy process.
This province can also be accelerated. In the architecture above armv5, CPU ---- cache --- memory, in which cache and
The memory width is 32 bits, but the CPU and cache bus width is indeed 64 bits, and 64 bits of storage are realized at the cost of 32 bits.
. If this can be used, the copy speed can be doubled theoretically. On PCs, our applications usually have
Fastmemorycopy: these functions are implemented using special commands such as SIMD. On armv5, they are implemented through their total
The line width to accelerate on the S3C2440 is not available: (it is a V4 architecture.
In general:
(1). algorithm-level optimization is basically unavailable, and FFMPEG/mplayer has implemented quite well, unless you implement
New decoder;
(2). At the code level, it is mainly accelerated through the inline (macro, inline function) and Assembly of key code. In
The ARM platform still has some potential to be tapped
(3). At the hardware level, the CPU architecture determines the instruction set, cache format, and size. For example, whether the instruction set is
Whether the enhanced DSP and SIMD commands are available, whether the cache can be configured, And the cache line size will affect the code level.
And algorithm-level optimization
(4). the system layer optimization puts it on the last layer because it is built on the entire system, only for the entire system
Systems, including hardware and software, can be achieved with a deep understanding.
Throughout the optimization, the essence is to remove redundant computing as much as possible and maximize the use of system hardware resources.
For the cpu Of the RISC architecture, the inherent disadvantage is that a relatively large memory bandwidth is required (because the instructions of
In the register, the operands must be loaded to the memory for calculation), CPU resources are used too much in the memory read and
Write.
The following code is used as an example to decode the output and replace the YUV space with an RGB space.
000111c:
111c: e92d4ff0 analytic dB SP !, {R4, R5, R6, R7, R8, R9, SL, FP, LR}
1120: e1a0a000 mov SL, R0
1124: e5900038 LDR r0, [r0, #56]
1128: e1a0c001 mov IP, r1
112c: e3500004 CMP r0, #4; 0x4
1130: e24dd034 sub sp, SP, #52; 0x34
1134: e1a00002 mov r0, r2
1138: e1a01003 mov R1, r3
113c: 0a00055d beq 157c
1140: e59d2058 LDR R2, [Sp, #88]
1144: e3520000 CMP R2, #0; 0x0
1148: d1a00002 movle r0, r2
114c: da00055b ble 1574
1150: e59d3060 LDR R3, [Sp, #96]
1154: e58d1030 STR R1, [Sp, #48]
1158: e5933000 LDR R3, [R3]
115c: e59f2434 LDR R2, [PC, #1076]; 1598 <. Text + 0x1598>
1160: e0213193 MLA R1, R3, R1, r3
1164: e58d3018 STR R3, [Sp, #24]
1168: e59d305c LDR R3, [Sp, #92]
Running C: e58d1000 STR R1, [Sp]
1170: e5933000 LDR R3, [R3]
1174: e79a1002 LDR R1, [SL, R2]
1178: e58d301c STR R3, [Sp, #28]
117c: e5903008 LDR R3, [r0, #8]
1158: e5933000 LDR R3, [R3]
115c: e59f2434 LDR R2, [PC, #1076]; 1598 <. Text + 0x1598>
1160: e0213193 MLA R1, R3, R1, r3
1164: e58d3018 STR R3, [Sp, #24]
1168: e59d305c LDR R3, [Sp, #92]
Running C: e58d1000 STR R1, [Sp]
1170: e5933000 LDR R3, [R3]
1174: e79a1002 LDR R1, [SL, R2]
1178: e58d301c STR R3, [Sp, #28]
117c: e5903008 LDR R3, [r0, #8]
1180: e59c4008 LDR R4, [IP, #8]
1184: e590e000 ldr lr, [R0]
1188: e59c2000 LDR R2, [IP]
118C: e5900004 LDR r0, [r0, #4]
1190: e59cc004 ldr ip, [IP, #4]
1194: e58d3014 STR R3, [Sp, #20]
1198: e1a011c1 mov R1, R1, ASR #3; h_size
119c: e3a03000 mov R3, #0; 0x0
11a0: e58d4010 STR R4, [Sp, #16]
11a4: e58d2004 STR R2, [Sp, #4]
11a8: e58de020 str lr, [Sp, #32]
11ac: e58d000c STR r0, [Sp, #12]
11b0: e58dc008 str ip, [Sp, #8]
11b4: e58d1028 STR R1, [Sp, #40]
11b8: e58d3024 STR R3, [Sp, #36]
11bc: e1a08003 mov R8, r3
........................................ .........
........................................ .........
We can find that there are too many LDR (load, read from memory) and
STR (store, wirte to memory), and excessive load and STR also affect the cache effect between CPU and memory.
Rate, forming cache jitter. When cache miss occurs, the cahce controller makes great efforts to move the content from memory
Cache, but this entry is not used to be replaced immediately. If you are not lucky, the cache will always be "jitters ".
In the decoding process, each module performs its own battles and each occupies a relatively large memory bandwidth.
How can we reduce this useless behavior? Key code must be adapted to the hardware architecture and Data Stream-related code must be coupled.
Together. Many code modules provide excellent readability and scalability. The fish and the bear's paw cannot have both sides, coupled in
The Code together will be obscure. FFmpeg/mplayer makes a good trade-off in this respect.
Knowledge 1, 2, and 3 is taken from the Internet. To better understand the above content, you need some knowledge about video encoding and decoding.

BTW: http://blog.csdn.net/weixianlin/archive/2008/05/01/2358035.aspx

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More