Neon command for yuv420 to rgb24 Conversion Efficiency

Source: Internet
Author: User
Tags mul

From the Internet found a neon command optimization yuv420 to rgb24 code, in the cortex-A8 architecture, clock speed 1g CPU for a frame of qcif (176x144) data test, in addition, compared with the popular algorithm written in C on the Internet, it is found that the speed of the former is more than 700 times that of the latter: the former uses 1000 ms for 112 cycles, and the latter uses 88645 Ms. The related code is as follows:

Assembly Code

Area |. text |, code, readonly; name this block of code export done; void imgyuv2rgb24_neon (u8 * pu8rgbbuffer, u8 * pu8srcyuv, L32 l32width, L32 l32height) pushed {R4, R5, r6, R7, R8, R9, R10, LR} running FDSP !, {R4-r10, LR} Add R4, R2, R2 add R4, R4, R2; R4: dststride = 3 * l32width Mul R5, R4, R3 sub R5, R5, R4 add r0, r0, R5; R0: pu8dst = pu8dst + l32dststride * (l32height-1) Mul R5, R2, R3 add R6, R1, R5; R6: pu8srcu = pu8srcyuv + l32width * l32height add R7, R6, R5, LSR #2; R7: pu8srcv = pu8srcu + (l32width * l32height)> 2); LSR R8, R2, #3; R8 records the number of Col cycles, R2 records the YUV Image Width mov R8, R2, LSR #3; LSR lR, R3, #1; LR records the number of cycles in the row. R3 records the YUV Image Height mov LR, R3, LSR #1 add R3, R1, R2; R1, pu8src1; r3: pu8src2, R2: l32width sub R5, R0, R4; R5: pu8dst2 = pu8dst-l32dststride mov R9, # 16vdup. 8 D8, r9mov R10, # 128vdup. 8 D9, r10mov R9, # 75vdup. 16 Q5, R9; Q5: 75mov R10, # 102vdup. 16 Q6, R10; Q6: 102mov R9, # 25vdup. 16 Q7, R9; Q7: 25mov R10, # 52vdup. 16 Q8, R10; Q8: 52mov R9, # 129vdup. 16 Q9, R9; Q9: 129 loop_rowlo Op_colsubs R8, R8, #1vld1. u8 D0, [R1]!; Yline1vld1. u8 D2, [R3]!; Yline2vld1.32 {D4 [0]}, [R6]!; Uvld1.32 {D4 [1]}, [R7]!; Vvsubl. u8 q0, D0, d8; yline2-16vsubl. u8 Q1, D2, d8; yline1-16vsubl. u8 Q2, D4, d9vmov Q3, q2vzip. s16q2, Q3; Q2: U-128 Q3: V-128; start to calculate the multiplication part vmul. s16 Q10, Q3, q8vmla. s16 Q10, q2, Q7; obtain the sum of U and V in the second half of the G component. s16 q11, q2, Q9; obtain the uvmul required for calculating the second half of B. s16 q12, Q3, Q6; obtain the V required for calculating the second half of the r component, and calculate the product vmul of Y. s16 q0, q0, Q5; q0 and Q1 get the 8-Point product vmul of the first line y. s16 Q1, Q1, Q5; Q2 and Q3 get the 8-Point product of the second row y; get the G component vqsub of the two rows. s16 q13, q0, q10vqsub. S16 q14, Q1, q10vqrshrun. s16 d27, q13, #6; G vqrshrun in the first line. s16 d30, q14, #6 ;;;;;;;;;;;;;;;;;;;;;;g of the second row gets the B component vqadd of the two rows. s16 Q10, q0, q11vqadd. s16 q11, Q1, q11vqrshrun. s16 d26, Q10, #6; bvqrshrun in the first line. s16 D29, q11, #6; ;;;;;;;;;;;;;;;;;;;; B of the second row; obtain the r component vqadd of the two rows. s16 q11, q0, q12vqadd. s16 q12, Q1, q12vqrshrun. s16 d28, q11, #6 ;;;;;;;;;;;;;;;;;;;;; rvqrshrun in the first line. s16 d31, q12, #6 ;;;;;;;;;; ;;;;;;;; The r ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; perform the interleave operation to form the RGB format, and then store it to the target buffervst3.8 {d26, d27, d28}, [R0]! Vst3.8 {D29, d30, d31}, [R5]! Bgt loop_colsubs LR, LR, #1 sub r0, R5, R4, LSL # 1sub R5, R0, r4add R1, R1, r2add R3, R3, R2; LSR R8, R2, #3 movr8, R2, LSR #3 Bgt loop_row; pop {R4, R5, R6, R7, R8, R9, R10, LR} ldmfdsp !, {R4-r10, LR} bx lr end

C code

Void yuv420p_to_rgb24 (unsigned char * yuv420 [3], unsigned char * rgb24, int width, int height) {// int begin = gettickcount (); int R, G, B, y, U, V; int X, Y; int nwidth = width> 1; // color signal width for (y = 0; y 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.