From the Internet found a neon command optimization yuv420 to rgb24 code, in the cortex-A8 architecture, clock speed 1g CPU for a frame of qcif (176x144) data test, in addition, compared with the popular algorithm written in C on the Internet, it is found that the speed of the former is more than 700 times that of the latter: the former uses 1000 ms for 112 cycles, and the latter uses 88645 Ms. The related code is as follows:
Assembly Code
Area |. text |, code, readonly; name this block of code export done; void imgyuv2rgb24_neon (u8 * pu8rgbbuffer, u8 * pu8srcyuv, L32 l32width, L32 l32height) pushed {R4, R5, r6, R7, R8, R9, R10, LR} running FDSP !, {R4-r10, LR} Add R4, R2, R2 add R4, R4, R2; R4: dststride = 3 * l32width Mul R5, R4, R3 sub R5, R5, R4 add r0, r0, R5; R0: pu8dst = pu8dst + l32dststride * (l32height-1) Mul R5, R2, R3 add R6, R1, R5; R6: pu8srcu = pu8srcyuv + l32width * l32height add R7, R6, R5, LSR #2; R7: pu8srcv = pu8srcu + (l32width * l32height)> 2); LSR R8, R2, #3; R8 records the number of Col cycles, R2 records the YUV Image Width mov R8, R2, LSR #3; LSR lR, R3, #1; LR records the number of cycles in the row. R3 records the YUV Image Height mov LR, R3, LSR #1 add R3, R1, R2; R1, pu8src1; r3: pu8src2, R2: l32width sub R5, R0, R4; R5: pu8dst2 = pu8dst-l32dststride mov R9, # 16vdup. 8 D8, r9mov R10, # 128vdup. 8 D9, r10mov R9, # 75vdup. 16 Q5, R9; Q5: 75mov R10, # 102vdup. 16 Q6, R10; Q6: 102mov R9, # 25vdup. 16 Q7, R9; Q7: 25mov R10, # 52vdup. 16 Q8, R10; Q8: 52mov R9, # 129vdup. 16 Q9, R9; Q9: 129 loop_rowlo Op_colsubs R8, R8, #1vld1. u8 D0, [R1]!; Yline1vld1. u8 D2, [R3]!; Yline2vld1.32 {D4 [0]}, [R6]!; Uvld1.32 {D4 [1]}, [R7]!; Vvsubl. u8 q0, D0, d8; yline2-16vsubl. u8 Q1, D2, d8; yline1-16vsubl. u8 Q2, D4, d9vmov Q3, q2vzip. s16q2, Q3; Q2: U-128 Q3: V-128; start to calculate the multiplication part vmul. s16 Q10, Q3, q8vmla. s16 Q10, q2, Q7; obtain the sum of U and V in the second half of the G component. s16 q11, q2, Q9; obtain the uvmul required for calculating the second half of B. s16 q12, Q3, Q6; obtain the V required for calculating the second half of the r component, and calculate the product vmul of Y. s16 q0, q0, Q5; q0 and Q1 get the 8-Point product vmul of the first line y. s16 Q1, Q1, Q5; Q2 and Q3 get the 8-Point product of the second row y; get the G component vqsub of the two rows. s16 q13, q0, q10vqsub. S16 q14, Q1, q10vqrshrun. s16 d27, q13, #6; G vqrshrun in the first line. s16 d30, q14, #6 ;;;;;;;;;;;;;;;;;;;;;;g of the second row gets the B component vqadd of the two rows. s16 Q10, q0, q11vqadd. s16 q11, Q1, q11vqrshrun. s16 d26, Q10, #6; bvqrshrun in the first line. s16 D29, q11, #6; ;;;;;;;;;;;;;;;;;;;; B of the second row; obtain the r component vqadd of the two rows. s16 q11, q0, q12vqadd. s16 q12, Q1, q12vqrshrun. s16 d28, q11, #6 ;;;;;;;;;;;;;;;;;;;;; rvqrshrun in the first line. s16 d31, q12, #6 ;;;;;;;;;; ;;;;;;;; The r ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; perform the interleave operation to form the RGB format, and then store it to the target buffervst3.8 {d26, d27, d28}, [R0]! Vst3.8 {D29, d30, d31}, [R5]! Bgt loop_colsubs LR, LR, #1 sub r0, R5, R4, LSL # 1sub R5, R0, r4add R1, R1, r2add R3, R3, R2; LSR R8, R2, #3 movr8, R2, LSR #3 Bgt loop_row; pop {R4, R5, R6, R7, R8, R9, R10, LR} ldmfdsp !, {R4-r10, LR} bx lr end
C code
Void yuv420p_to_rgb24 (unsigned char * yuv420 [3], unsigned char * rgb24, int width, int height) {// int begin = gettickcount (); int R, G, B, y, U, V; int X, Y; int nwidth = width> 1; // color signal width for (y = 0; y