[Z] Bank conflict conflicts in Cuda

Source: Internet
Author: User

In fact, the Bank conflict has not been known for the past two days. In the past two days, we have to look at the problem of matrix transpose optimization. These problems have been mentioned, but there is no way to solve them, in order to understand the bank's conflict conflicts, I had to look for information and tell the truth. I didn't fully understand it now, but I should say it was a bit eye-catching. Now I will sort out the online searches, put it here, and I will modify the errors when I fully understand it one day.

Each sm of Tesla has 16 KB shared memory, which is used for communication between threads in the same thread block. To enable concurrent access by a thread in half-warp in a kernel cycle, the shared memory is organized into 16 banks, each of which has a 32-bit width, therefore, each bank can store 256 integer or single-precision floating point numbers, or the current bank is organized into a matrix of 16 columns in 256 rows. If some threads in a half-warp access data belonging to the same bank, the bank conflict will be generated to reduce the memory access efficiency. In the case of the most serious conflict, the speed is slower than the global display speed. However, if the half-warp thread accesses the same address, a broadcast is generated, but the speed is not decreased. In the absence of bank conflict, the access speed to shared memory is the same as that of registers. Shared Memory is irrelevant between different blocks. ------ Fengchen's Cuda getting started tutorial

It is clear that each bank has a storage space of 1 kb.

Shared Memory is divided into banks in 4 bytes. Therefore, assume the following data:
_ Shared _ int data [128];
Data [0] is bank 0, data [1] is bank 1, data [2] is bank 2 ,... Data [15] is bank 15, and data [16] returns to bank 0. Since warp is executed in half-warp mode, threads of different half warp will not cause bank conflict.
Therefore, if the program accesses shared memory in the following ways:
Int number = data [base + TID];
Then there will be no bank conflict, which can achieve the highest efficiency. However, if the following method is used:
Int number = data [base + 4 * TID];
Then, Thread 0 and thread 4 will access the same bank, and thread 1 and thread 5 will also be the same, which will cause bank conflict. In this example, the 16 threads of a half warp will have four threads accessing the same bank, so the speed of accessing share memory will change to 1/4.
An important exception is that when multiple threads access the same shared memory address, shared memory can broadcast 32 bits of this address to all read threads, therefore, it will not cause bank conflict. For example:
Int number = data [3];
This will not cause bank conflict, because all threads read data from the same address.
In many cases, the Bank conflict of shared memory can be solved by modifying the data storage method. For example, the following program:
Data [TID] = global_data [TID];
...
Int number = data [16 * TID];

This may cause serious bank conflict. To avoid this problem, You Can slightly modify the data arrangement method and change the access method:
Int ROW = tid/16;
Int column = TID % 16;
Data [row * 17 + column] = global_data [TID];
...
Int number = data [17 * TID];
This will not cause bank conflict.

In short, the data in the matrix is stored by bank, and the I data is stored in the I % 16 bank. A block needs to access Shared Memory. As long as it can ensure that a group of 16 adjacent threads can access the thread, each thread has a one-to-one correspondence with the bank, no bank conflict will be generated. Otherwise, bankconflict will be generated, and the access time will multiply. The increase is determined by the maximum number of threads simultaneously accessing a bank. In an extreme situation, when all 16 threads access the same bank at the same time, they only need one access cycle, which produces a broadcast.

Below are some tips to avoid bank conflict or improve the access speed of global memory

1. Perform row-based operations as much as possible. You can transpose the matrix before column-based operations.

2. When partitioning a molecular problem, make the width of the problem processed by each block exactly equal to an integer multiple of 16, so that the access to memory can be performed in the form of s_data [TID] = I _data [TID]

3. Use aligned data formats, such as float3 and int2 defined by NVIDIA, which are already aligned.

4. When the matrix width to be processed is not an integer multiple of 16, fill it with an integer multiple of 16, or use malloctopitch instead of malloc.

5. Broadcast, such as s_odata [TID] = TID % 16 <8? S_idata [TID]: s_idata [15]; will generate eight-way block access conflicts and use: s_odata [TID] = s_idata [15]; s_odata [TID] = TID % 16 <8? S_idata [TID]: s_data [TID ];

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.