1 Consolidated Access
The most advantageous access pattern is obtained when all threads in the same warp execute the same instruction to access contiguous units in the global memory. The hardware detects that these threads in the same warp access contiguous storage units in the global memory and combines them into one consolidated access .
Combined access increases the bandwidth utilization of DRAM, allowing DRAM to reach the peak of the global memory bandwidth at the speed of data transmission.
Here's a little bit of knowledge: linear mapping of two-dimensional three-dimensional arrays- multidimensional threading, dimension mapping to linear order .
The two-dimensional threads are arranged in a linear order as shown in the following figure:
That is to say, the matrix has been equivalently linearized into an equivalent one-dimensional array in memory space, and stored in line-first order.
Examples are as follows:
(1) Two access modes in the multiplication of GPU matrices that do not use shared memory
Bonded access Mode
In the first iteration, the No. 0 elements are accessed, the No. 0 elements are adjacent to the global memory, and the hardware engages these accesses:
Non-bonded access mode
In this example, the following conclusions are drawn:
for global storage access, it is much less efficient to iterate through a row kernel than to traverse a column.
(2) in the matrix multiplication using shared memory, the data is loaded using the merge method
Load in merged mode
DS_A[TY][TX] = A[row*n+t*tile_width+tx];
DS_B[TY][TX] = b[(t*tile_width + ty) *k+col];
Each thread is responsible for loading an MD and ND element.
M identifies the position of the left tile. A row of MD tiles is loads loaded by tile_width lines, threadidx.x changes.
2 Instruction Mix
In the current device, the instruction processing bandwidth of each SM is limited. Each instruction occupies the instruction processing bandwidth, including floating-point calculation instructions, loading instructions, and branch instructions . Eliminating repetitive instructions reduces the pressure on the instruction processing bandwidth and improves the overall execution performance of the kernel function.
The following two lines of code are described as examples:
for (int i = 0; i < tile_width; ++i)
cvalue + = ds_a[ty][i] * DS_B[I][TX]
These two lines of code include several directives:
(1) Cyclic introduction of additional Instructions update counter K 1 times
(2) execution conditions Jump 1 times at the end of each iteration
(3) using K to calculate Mds,nds index introduced the address operation Instruction 2 times
(4) floating point multiplication plus calculation instruction 2 times
The floating-point multiply plus calculation instruction accounts for only 1/3 of the instructions. Because of the limited command processing bandwidth, this mix of instructions will be able to achieve a performance limit of up to 1/3 of the peak bandwidth.
In order to improve this mixing of instructions, the method of cyclic expansion (Unroll loop) is adopted.
Modify the code above to read as follows:
Cvalue + = ds_a[ty][0] * ds_b[0][tx]+ds_a[ty][1] * DS_B[1][TX]
+ds_a[ty][2] * ds_b[2][tx]+ds_a[ty][3] * ds_B[3][tx]< C2/>+DS_A[TY][4] * ds_b[4][tx]+ds_a[ty][5] * DS_B[5][TX]
+ds_a[ty][6] * ds_b[6][tx]+ds_a[ty][7] * DS_B[7][TX]
+DS_A[TY][8] * ds_b[8][tx]+ds_a[ty][9] * DS_B[9][TX]
+ds_a[ty][10] * ds_b[10][tx]+ds_a[ty][11] * DS_B[11][TX]
+DS_A[TY][12] * ds_b[12][tx]+ds_a[ty][13] * DS_B[13][TX]
+ds_a[ty][14] * ds_b[14][tx]+ds_a[ty][15] * DS_B[15][TX];
Code Analysis:
Long-multiply-add operation
Eliminate branch instructions and loop counter updates
The index is constant-the compiler can use the offset of the addressing mode of the load instruction, which eliminates the address operation instructions.
as a result, the execution speed of this very long expression is nearly as close to the peak of performance.