9. Cuda shared memory use ------ GPU revolution
Preface: I will graduate next year and plan for my future life in the second half of the year. In the past six months, it may be a decision and a decision. Maybe I have a strong sense of crisis and have always felt that I have not done well enough. I still need to accumulate and learn. Maybe it's awesome to know that you can go to Hong Kong from the Hill Valley. Step by step, you are satisfied, but you have always held an ideal in your mind, insisted on doing one thing, insisted on doing something, and be steadfast, once failed, once lost, learned to persevere, learned to persevere, and then experienced calm and calm. On the road of life, all the way to learn, with a grateful heart, to help others, is to help yourself, the road to the future will be wider ...... Text: The book is connected to Article 8. cuda memory uses global 2 ------ GPU revolution. This article describes the alignment problem when accessing global memory. Only alignment can ensure efficient access to global memory. This section describes how to access Shared Memory. First, we will talk about the two methods of using shared memory. Then, we will explain the problem of the Bank conflict of shared memory, this is the question about the efficiency of shared memory access. Common usage of shared memory: 1. use an array of fixed sizes:
/*************************************** *********************************/
/* Example */
/*************************************** *********************************/
_ Global _ void shared_memory_1 (float * result, int num, float * table_1)
{
_ Shared _ float sh_data [thread_size];
Int idx = threadidx. X;
Float ret = 0.0f;
Sh_data [idx] = table_1 [idx];
For (INT I = 0; I
{
RET + = sh_data [idx % bank_conflict];
}
Result [idx] = ret;
} Here, sh_data is a fixed-size array; 2. Use a dynamically allocated array:
Extern _ shared _ char array [];
_ Global _ void shared_memory_1 (float * result, int num, float * table_1, int shared_size)
{
Float * sh_data = (float *) array; // here, let sh_data point to the first address of shared memory to dynamically allocate space.
Float * sh_data2 = (float *) & sh_data [shared_size]; // the size of shared_size here is the size of sh_data;
Int idx = threadidx. X;
Float ret = 0.0f;
Sh_data [idx] = table_1 [idx];
For (INT I = 0; I
{
RET + = sh_data [idx % bank_conflict];
}
Result [idx] = ret;
}
Here is the dynamically allocated space, extern _ shared _ char array []; specifies the address of the first variable of shared, which actually points to the address of the shared memory space; the next Dynamic Allocation float * sh_data = (float *) array; making sh_data point to array is actually the first address on shared memory;
Float * sh_data2 = (float *) & sh_data [shared_size]; here, sh_data2 refers to the shared_size address of the first sh_data, that is, sh_data is the dynamically allocated space with shared_size;
Inbound:
3. the following describes the bank conflict. We know that each half-warp has 16 threads, and shared memory has 16 banks. How can we allocate these 16 threads, go to their respective banks to retrieve shared memory. If everyone withdraws money from the same bank, they will queue up, which leads to bank conflict, the above code can be used to verify the impact of bank conflict on code performance:
/*************************************** *********************************/
/* Example */
/*************************************** *********************************/
_ Global _ void shared_memory_1 (float * result, int num, float * table_1)
{
_ Shared _ float sh_data [thread_size];
Int idx = threadidx. X;
Float ret = 0.0f;
Sh_data [idx] = table_1 [idx];
For (INT I = 0; I
{
RET + = sh_data [idx % bank_conflict];
}
Result [idx] = ret;
}
// 1, 2, 3, 4, 5, 6, 7... 16
# Define bank_conflict 16
The bank_conflict here is defined as the size from 1 to 16. You can modify it to see the impact of bank conflict on performance. When bank_conflict is 2, generally, eight threads access the same bank at the same time. Because idx % 2 has only two 0 and 1 values, 16 will access bank0 and bank1, and so on, you can test the overall performance;
Below:
Of course, we can also use 16 bank conflict. When everyone accesses the same data of the same bank, we can form a broadcast, which will broadcast the data to 16 threads at the same time, in this way, we can reasonably use the shared memory broadcast opportunity.
The code below should be pasted and tested by yourself;
/*************************************** *****************************
* Shared_memory_test.cu
* This is a example of the Cuda program.
* Author: Zhao. kaiyong (AT) gmail.com
* Http://blog.csdn.net/openhero
* Http://www.comp.hkbu.edu.hk /~ Kyzhao/
**************************************** *****************************/
# Include
# Include
# Include
# Include
// 1, 2, 3, 4, 5, 6, 7... 16
# Define bank_conflict 16
# Define thread_size 16
/*************************************** *********************************/
/* Static */
/*************************************** *********************************/
_ Global _ void shared_memory_static (float * result, int num, float * table_1)
{
_ Shared _ float sh_data [thread_size];
Int idx = threadidx. X;
Float ret = 0.0f;
Sh_data [idx] = table_1 [idx];
For (INT I = 0; I
{
RET + = sh_data [idx % bank_conflict];
}
Result [idx] = ret;
}
/*************************************** *********************************/
/* Dynamic */
/*************************************** *********************************/
Extern _ shared _ char array [];
_ Global _ void shared_memory_dynamic (float * result, int num, float * table_1, int shared_size)
{
Float * sh_data = (float *) array; // here, let sh_data point to the first address of shared memory to dynamically allocate space.
Float * sh_data2 = (float *) & sh_data [shared_size]; // the size of shared_size here is the size of sh_data;
Int idx = threadidx. X;
Float ret = 0.0f;
Sh_data [idx] = table_1 [idx];
For (INT I = 0; I
{
RET + = sh_data [idx % bank_conflict];
}
Result [idx] = ret;
}
/*************************************** *********************************/
/* Bank conflict */
/*************************************** *********************************/
_ Global _ void shared_memory_bankconflict (float * result, int num, float * table_1)
{
_ Shared _ float sh_data [thread_size];
Int idx = threadidx. X;
Float ret = 0.0f;
Sh_data [idx] = table_1 [idx];
For (INT I = 0; I
{
RET + = sh_data [idx % bank_conflict];
}
Result [idx] = ret;
}
/*************************************** *********************************/
/* Hellocuda */
/*************************************** *********************************/
Int main (INT argc, char * argv [])
{
If (cutcheckcmdlineflag (argc, (const char **) argv, "device "))
{
Cutildeviceinit (argc, argv );
} Else
{
Int id = cutgetmaxgflopsdeviceid ();
Cudasetdevice (ID );
}
Float * device_result = NULL;
Float host_result [thread_size] = {0 };
Cuda_safe_call (cudamalloc (void **) & device_result, sizeof (float) * thread_size ));
Float * device_table_1 = NULL;
Float host_table1 [thread_size] = {0 };
For (INT I = 0; I
{
Host_table1 [I] = rand () % rand_max;
}
Cuda_safe_call (cudamalloc (void **) & device_table_1, sizeof (float) * thread_size ));
Cuda_safe_call (cudamemcpy (device_table_1, host_table1, sizeof (float) * thread_size, cudamemcpyhosttodevice ));
Unsigned int timer = 0;
Cut_safe_call (cutcreatetimer (& timer ));
Cut_safe_call (cutstarttimer (timer ));
Shared_memory_static> (device_result, 1000, device_table_1 );
// Shared_memory_dynamic> (device_result, 1000, device_table_1, 16 );
// Shared_memory_bankconflict> (device_result, 1000, device_table_1 );
Cut_check_error ("kernel execution failed/N ");
Cuda_safe_call (cudamemcpy (host_result, device_result, sizeof (float) * thread_size, cudamemcpydevicetohost ));
Cut_safe_call (cutstoptimer (timer ));
Printf ("Processing Time: % F (MS)/n", cutgettimervalue (timer ));
Cut_safe_call (cutdeletetimer (timer ));
For (INT I = 0; I
{
Printf ("% F", host_result [I]);
}
Cuda_safe_call (cudafree (device_result ));
Cuda_safe_call (cudafree (device_table_1 ));
Cutilexit (argc, argv );
}
This is just a simple demo. You can test it. In the next section, we will introduce more features of shared memory and further explain some hidden features of shared memory;
The following sections will introduce the use of constant and texture;
The writing content has always been more text and less code. In fact, the learning process is more important. The practical code should be written by yourself. The only thing that can be learned is the idea, more importantly, it is also the exchange of ideas and dissemination of knowledge. The best thing is the dissemination of ideas, code, and methods. They are just some tools. However, the skill level of the tool requires you to practice more on your own.