Cuda Performance Tuning (i)--Consolidated access & Loop expansion

Source: Internet
Author: User

1 Consolidated Access

The most advantageous access pattern is obtained when all threads in the same warp execute the same instruction to access contiguous units in the global memory. The hardware detects that these threads in the same warp access contiguous storage units in the global memory and combines them into one consolidated access .

Combined access increases the bandwidth utilization of DRAM, allowing DRAM to reach the peak of the global memory bandwidth at the speed of data transmission.

Here's a little bit of knowledge: linear mapping of two-dimensional three-dimensional arrays- multidimensional threading, dimension mapping to linear order .

The two-dimensional threads are arranged in a linear order as shown in the following figure:


That is to say, the matrix has been equivalently linearized into an equivalent one-dimensional array in memory space, and stored in line-first order.

Examples are as follows:

(1) Two access modes in the multiplication of GPU matrices that do not use shared memory


Bonded access Mode

In the first iteration, the No. 0 elements are accessed, the No. 0 elements are adjacent to the global memory, and the hardware engages these accesses:



Non-bonded access mode


In this example, the following conclusions are drawn:

for global storage access, it is much less efficient to iterate through a row kernel than to traverse a column.

(2) in the matrix multiplication using shared memory, the data is loaded using the merge method

Load in merged mode
DS_A[TY][TX] = A[row*n+t*tile_width+tx];
DS_B[TY][TX] = b[(t*tile_width + ty) *k+col];


Each thread is responsible for loading an MD and ND element.

M identifies the position of the left tile. A row of MD tiles is loads loaded by tile_width lines, threadidx.x changes.

2 Instruction Mix

In the current device, the instruction processing bandwidth of each SM is limited. Each instruction occupies the instruction processing bandwidth, including floating-point calculation instructions, loading instructions, and branch instructions . Eliminating repetitive instructions reduces the pressure on the instruction processing bandwidth and improves the overall execution performance of the kernel function.

The following two lines of code are described as examples:

  for (int i = 0; i < tile_width; ++i)  
            cvalue + = ds_a[ty][i] * DS_B[I][TX]
These two lines of code include several directives:

(1) Cyclic introduction of additional Instructions update counter K 1 times

(2) execution conditions Jump 1 times at the end of each iteration

(3) using K to calculate Mds,nds index introduced the address operation Instruction 2 times

(4) floating point multiplication plus calculation instruction 2 times

The floating-point multiply plus calculation instruction accounts for only 1/3 of the instructions. Because of the limited command processing bandwidth, this mix of instructions will be able to achieve a performance limit of up to 1/3 of the peak bandwidth.

In order to improve this mixing of instructions, the method of cyclic expansion (Unroll loop) is adopted.

Modify the code above to read as follows:

Cvalue + = ds_a[ty][0] * ds_b[0][tx]+ds_a[ty][1] * DS_B[1][TX]
		         +ds_a[ty][2] * ds_b[2][tx]+ds_a[ty][3] * ds_B[3][tx]< C2/>+DS_A[TY][4] * ds_b[4][tx]+ds_a[ty][5] * DS_B[5][TX]
				 +ds_a[ty][6] * ds_b[6][tx]+ds_a[ty][7] * DS_B[7][TX]
				 +DS_A[TY][8] * ds_b[8][tx]+ds_a[ty][9] * DS_B[9][TX]
				 +ds_a[ty][10] * ds_b[10][tx]+ds_a[ty][11] * DS_B[11][TX]
				 +DS_A[TY][12] * ds_b[12][tx]+ds_a[ty][13] * DS_B[13][TX]
				 +ds_a[ty][14] * ds_b[14][tx]+ds_a[ty][15] * DS_B[15][TX];  

Code Analysis:

Long-multiply-add operation

Eliminate branch instructions and loop counter updates

The index is constant-the compiler can use the offset of the addressing mode of the load instruction, which eliminates the address operation instructions.

as a result, the execution speed of this very long expression is nearly as close to the peak of performance.





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.