Arm command Optimization-coding for neon-Part 3: Matrix Multiplication

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We have seen how to load and store data with neon,
And how to handle the leftovers resulting from
Vector processing. Let us move on to doing some useful data processing-multiplying matrices.

Matrices

In this post, we will look at how to efficiently multiply four-by-four matrices together, an operation frequently used in the world of 3D
Graphics. We will assume that the matrices are stored in memory in column-Major
Order-this is the format used by OpenGL-es.

Algorithm

We start by examining the matrix mutiply operation in detail, by expanding the calculation, and identifying sub-operations that can be implemented using neon instructions.

Notice that in the divisor, we multiply each column of the First matrix (in red) by a corresponding single value in the second matrix (blue) then add together the results for each element to give a column of results. this operation is repeated for each of
Four columns in the result matrix.

If each column is a vector in a neon register, we can useVector-by-scalar multiplicationInstruction to calculate efficiently each result column. The sub-operation highlighted in the divisor can be implemented
Using this instruction. We must also add the results together for each element of the column, which we do using the accumulating version of the same struction.

As we are operating on the columns of the first matrix, and producing a column of results, reading and writing elements to and from memory is a linear operation, and requires no interleaving load or store instructions.

Codefloating point

First, we will look at an implementation that multiplies single precision Floating Point matrices.

Begin by loading the matrices from memory into neon registers. the matrices we are multiplying use column-major order, so columns of the matrix are stored linearly in memory. A column can be loaded into neon registers usingVLD1,
And written back to memory usingVST1.

    vld1.32 {d16-d19}, [r1]!            @ load first eight elements of matrix 0    vld1.32 {d20-d23}, [r1]!            @ load second eight elements of matrix 0    vld1.32 {d0-d3}, [r2]!              @ load first eight elements of matrix 1    vld1.32 {d4-d7}, [r2]!              @ load second eight elements of matrix 1

As neon has 32 64-bit registers, we can load all of the elements from both input matrices into registers, and still have registers left over for use as accumulators. here, D16 to d23 hold 16 elements from the first matrix, and D0 to D7 hold 16 elements from
The second.

An aside: D and Q registers

Most neon instructions can use the register bank in two ways:

As 32Double-word registers, 64-bits in size, named D0 to d31.
As 16Quad-word registers128-bits in size, named q0 to q15.

These registers are aliased so thatThe data in a Q register is the same as that in its two corresponding D registers. For example, q0 is aliased to D0 and D1, and the same data is accessible through either register type. In C
Terms, this is very similar to a union.

For the floating point matrix multiplication example, we will use Q registers frequently, as we are handling columns of four 32-bit floating point numbers, which fit into a single 128-bit Q register.

Back to the code

We can calculate a column of results using just four neon multiply instructions:

    vmul.f32    q12, q8, d0[0]      @ multiply col element 0 by matrix col 0    vmla.f32    q12, q9, d0[1]      @ multiply-acc col element 1 by matrix col 1    vmla.f32    q12, q10, d1[0] @ multiply-acc col element 2 by matrix col 2    vmla.f32    q12, q11, d1[1] @ multiply-acc col element 3 by matrix col 3

Here, the first instruction implements the operation highlighted in the distri-x0, x1, x2 and X3 (in register Q8) are each multiplied by y0 (element 0 in D0 ), and stored in q12.

Subsequent instructions operate on the other columns of the first matrix, multiplying by corresponding elements of the first column of the Second matrix. results are accumulated into q12 to give the first column of values for the result matrix.

Notice that the scalar used in the multiply instructions refers to d registers; although q0 [3] shocould be the same value as D1 [1], and using it wowould perhaps make more sense here, the GNU receiver er I'm using does not accept that format. I have to specify a scalar
From a d register. Your guest er may be better.

If we only needed to calculate a matrix-by-vector multiplication (another common operation in 3D graphics,) the operation wocould now be complete, and the result vector can be stored to memory. however, to complete the matrix-by-matrix multiplication, we must
Execute three more iterations, using values Y4 to YF in registers Q1 to Q3.

If we create a macro for the instructions above, we can simplify our code significantly.

    .macro mul_col_f32 res_q, col0_d, col1_d    vmul.f32    \res_q, q8, \col0_d[0]      @ multiply col element 0 by matrix col 0    vmla.f32    \res_q, q9, \col0_d[1]      @ multiply-acc col element 1 by matrix col 1    vmla.f32    \res_q, q10, \col1_d[0] @ multiply-acc col element 2 by matrix col 2    vmla.f32    \res_q, q11, \col1_d[1] @ multiply-acc col element 3 by matrix col 3    .endm

The implementation of a four-by-four floating point matrix multiply now looks like this.

    vld1.32 {d16-d19}, [r1]!            @ load first eight elements of matrix 0    vld1.32 {d20-d23}, [r1]!            @ load second eight elements of matrix 0    vld1.32 {d0-d3}, [r2]!              @ load first eight elements of matrix 1    vld1.32 {d4-d7}, [r2]!              @ load second eight elements of matrix 1    mul_col_f32 q12, d0, d1         @ matrix 0 * matrix 1 col 0    mul_col_f32 q13, d2, d3         @ matrix 0 * matrix 1 col 1    mul_col_f32 q14, d4, d5         @ matrix 0 * matrix 1 col 2    mul_col_f32 q15, d6, d7         @ matrix 0 * matrix 1 col 3    vst1.32 {d24-d27}, [r0]!            @ store first eight elements of result    vst1.32 {d28-d31}, [r0]!            @ store second eight elements of result

Fixed Point

Using Fixed Point Arithmetic for calculations is often faster than floating point-it requires less memory bandwidth to read and write values that use fewer bits, and multiplication of integer values is generally quicker than the same operations applied
Floating Point Numbers.

However, when using fixed point arithmetic, you must choose the representation carefully to avoid overflow or saturation, whilst preserving the degree of precision in the results that your application requires.

Implementing a matrix multiply using fixed point values is very similar to floating point. In this example, we will useq1.14
Fixed-Point format, but the operations required are similar for other formats, and may only require a change to the final shift applied to the accumulator. Here is the macro:

    .macro mul_col_s16 res_d, col_d    vmull.s16   q12, d16, \col_d[0] @ multiply col element 0 by matrix col 0    vmlal.s16   q12, d17, \col_d[1] @ multiply-acc col element 1 by matrix col 1    vmlal.s16   q12, d18, \col_d[2] @ multiply-acc col element 2 by matrix col 2    vmlal.s16   q12, d19, \col_d[3] @ multiply-acc col element 3 by matrix col 3    vqrshrn.s32 \res_d, q12, #14            @ shift right and narrow accumulator into                                            @  Q1.14 fixed point format, with saturation    .endm

Comparing it to the macro used in the floating point version, you will see that the major differences are:

Values are now 16-bit rather than 32-bit, so we can use d registers to hold four inputs.
The result of multiplying two 16-bit numbers is a 32-bit number. We useVMULLAndVMLAL,
Because they store their results in Q registers, preserving all of the bits of the result using double-size elements.
The final result must be 16-bits, but the accumulators are 32-bit. We obtain a 16-bit result usingVQRSHRN, A vector, saturating,
Rounding, narrowing shift right. This adds the correct rounding value to each element, shifts it right, and saturates the result to the new, narrower element size.

The function from 32-bits to 16-bits per element also has an effect on the memory accesses; the data can be loaded and stored using fewer struinctions. the code for a fixed point matrix multiply looks like this:

    vld1.16 {d16-d19}, [r1]     @ load sixteen elements of matrix 0    vld1.16 {d0-d3}, [r2]       @ load sixteen elements of matrix 1    mul_col_s16 d4, d0                      @ matrix 0 * matrix 1 col 0    mul_col_s16 d5, d1                      @ matrix 0 * matrix 1 col 1    mul_col_s16 d6, d2                      @ matrix 0 * matrix 1 col 2    mul_col_s16 d7, d3                      @ matrix 0 * matrix 1 col 3    vst1.16 {d4-d7}, [r0]       @ store sixteen elements of result

Scheduling

We will deal with the details of scheduling in a future post, but for now, it is worth seeing the effect of improved instruction scheduling on this code.

In the macro, adjacent multiply instructions write to the same register, so the neon pipeline must wait for each multiply to complete before it can start the next instruction.

If we take the instructions out of the macro and rearrange them, we can separate those that are dependent using other instructions that are not dependent. these instructions can be issued whilst the others complete in the background. in this case, we rearrange
The code to space out accesses to the accumulator registers.

    vmul.f32    q12, q8, d0[0]              @ rslt col0  = (mat0 col0) * (mat1 col0 elt0)    vmul.f32    q13, q8, d2[0]              @ rslt col1  = (mat0 col0) * (mat1 col1 elt0)    vmul.f32    q14, q8, d4[0]              @ rslt col2  = (mat0 col0) * (mat1 col2 elt0)    vmul.f32    q15, q8, d6[0]              @ rslt col3  = (mat0 col0) * (mat1 col3 elt0)    vmla.f32    q12, q9, d0[1]              @ rslt col0 += (mat0 col1) * (mat1 col0 elt1)    vmla.f32    q13, q9, d2[1]              @ rslt col1 += (mat0 col1) * (mat1 col1 elt1)    ...    ...

Using this version, matrix multiply performance almost doubles on a Cortex-A8 based
System.

You can find details for instruction timings and latencies from the Technical
Reference manual for your cortex core. With potential performance improvements like those described above, it is well worth spending some time familiarizing yourself with it.

Source

The code for the two functions described above can be found here:
Matrix_asm_sched.s(4.54 K)
Number of downloads: 1442

From:

Http://blogs.arm.com/software-enablement/241-coding-for-neon-part-3-matrix-multiplication/

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Arm command Optimization-coding for neon-Part 3: Matrix Multiplication

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Arm command Optimization-coding for neon-Part 3: Matrix Multiplication

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support