9. Cuda shared memory usage-GPU revolution

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

9. Cuda shared memory use ------ GPU revolution

Preface: I will graduate next year and plan for my future life in the second half of the year. In the past six months, it may be a decision and a decision. Maybe I have a strong sense of crisis and have always felt that I have not done well enough. I still need to accumulate and learn. Maybe it's awesome to know that you can go to Hong Kong from the Hill Valley. Step by step, you are satisfied, but you have always held an ideal in your mind, insisted on doing one thing, insisted on doing something, and be steadfast, once failed, once lost, learned to persevere, learned to persevere, and then experienced calm and calm. On the road of life, all the way to learn, with a grateful heart, to help others, is to help yourself, the road to the future will be wider ...... Text: The book is connected to Article 8. cuda memory uses global 2 ------ GPU revolution. This article describes the alignment problem when accessing global memory. Only alignment can ensure efficient access to global memory. This section describes how to access Shared Memory. First, we will talk about the two methods of using shared memory. Then, we will explain the problem of the Bank conflict of shared memory, this is the question about the efficiency of shared memory access. Common usage of shared memory: 1. use an array of fixed sizes:

/*************************************** *********************************/

/* Example */

/*************************************** *********************************/

_ Global _ void shared_memory_1 (float * result, int num, float * table_1)

{

_ Shared _ float sh_data [thread_size];

Int idx = threadidx. X;

Float ret = 0.0f;

Sh_data [idx] = table_1 [idx];

For (INT I = 0; I

{

RET + = sh_data [idx % bank_conflict];

}

Result [idx] = ret;

} Here, sh_data is a fixed-size array; 2. Use a dynamically allocated array:

Extern _ shared _ char array [];

_ Global _ void shared_memory_1 (float * result, int num, float * table_1, int shared_size)

{

Float * sh_data = (float *) array; // here, let sh_data point to the first address of shared memory to dynamically allocate space.

Float * sh_data2 = (float *) & sh_data [shared_size]; // the size of shared_size here is the size of sh_data;

Int idx = threadidx. X;

Float ret = 0.0f;

Sh_data [idx] = table_1 [idx];

For (INT I = 0; I

{

RET + = sh_data [idx % bank_conflict];

}

Result [idx] = ret;

}

Here is the dynamically allocated space, extern _ shared _ char array []; specifies the address of the first variable of shared, which actually points to the address of the shared memory space; the next Dynamic Allocation float * sh_data = (float *) array; making sh_data point to array is actually the first address on shared memory;

Float * sh_data2 = (float *) & sh_data [shared_size]; here, sh_data2 refers to the shared_size address of the first sh_data, that is, sh_data is the dynamically allocated space with shared_size;

Inbound:

3. the following describes the bank conflict. We know that each half-warp has 16 threads, and shared memory has 16 banks. How can we allocate these 16 threads, go to their respective banks to retrieve shared memory. If everyone withdraws money from the same bank, they will queue up, which leads to bank conflict, the above code can be used to verify the impact of bank conflict on code performance:

/*************************************** *********************************/

/* Example */

/*************************************** *********************************/

_ Global _ void shared_memory_1 (float * result, int num, float * table_1)

{

_ Shared _ float sh_data [thread_size];

Int idx = threadidx. X;

Float ret = 0.0f;

Sh_data [idx] = table_1 [idx];

For (INT I = 0; I

{

RET + = sh_data [idx % bank_conflict];

}

Result [idx] = ret;

}

// 1, 2, 3, 4, 5, 6, 7... 16

# Define bank_conflict 16

The bank_conflict here is defined as the size from 1 to 16. You can modify it to see the impact of bank conflict on performance. When bank_conflict is 2, generally, eight threads access the same bank at the same time. Because idx % 2 has only two 0 and 1 values, 16 will access bank0 and bank1, and so on, you can test the overall performance;

Below:

Of course, we can also use 16 bank conflict. When everyone accesses the same data of the same bank, we can form a broadcast, which will broadcast the data to 16 threads at the same time, in this way, we can reasonably use the shared memory broadcast opportunity.

The code below should be pasted and tested by yourself;

/*************************************** *****************************

* Shared_memory_test.cu

* This is a example of the Cuda program.

* Author: Zhao. kaiyong (AT) gmail.com

* Http://blog.csdn.net/openhero

* Http://www.comp.hkbu.edu.hk /~ Kyzhao/

**************************************** *****************************/

# Include

// 1, 2, 3, 4, 5, 6, 7... 16

# Define bank_conflict 16

# Define thread_size 16

/*************************************** *********************************/

/* Static */

/*************************************** *********************************/

_ Global _ void shared_memory_static (float * result, int num, float * table_1)

{

_ Shared _ float sh_data [thread_size];

Int idx = threadidx. X;

Float ret = 0.0f;

Sh_data [idx] = table_1 [idx];

For (INT I = 0; I

{

RET + = sh_data [idx % bank_conflict];

}

Result [idx] = ret;

}

/*************************************** *********************************/

/* Dynamic */

/*************************************** *********************************/

Extern _ shared _ char array [];

_ Global _ void shared_memory_dynamic (float * result, int num, float * table_1, int shared_size)

{

Float * sh_data = (float *) array; // here, let sh_data point to the first address of shared memory to dynamically allocate space.

Float * sh_data2 = (float *) & sh_data [shared_size]; // the size of shared_size here is the size of sh_data;

Int idx = threadidx. X;

Float ret = 0.0f;

Sh_data [idx] = table_1 [idx];

For (INT I = 0; I

{

RET + = sh_data [idx % bank_conflict];

}

Result [idx] = ret;

}

/*************************************** *********************************/

/* Bank conflict */

/*************************************** *********************************/

_ Global _ void shared_memory_bankconflict (float * result, int num, float * table_1)

{

_ Shared _ float sh_data [thread_size];

Int idx = threadidx. X;

Float ret = 0.0f;

Sh_data [idx] = table_1 [idx];

For (INT I = 0; I

{

RET + = sh_data [idx % bank_conflict];

}

Result [idx] = ret;

}

/*************************************** *********************************/

/* Hellocuda */

/*************************************** *********************************/

Int main (INT argc, char * argv [])

{

If (cutcheckcmdlineflag (argc, (const char **) argv, "device "))

{

Cutildeviceinit (argc, argv );

} Else

{

Int id = cutgetmaxgflopsdeviceid ();

Cudasetdevice (ID );

}

Float * device_result = NULL;

Float host_result [thread_size] = {0 };

Cuda_safe_call (cudamalloc (void **) & device_result, sizeof (float) * thread_size ));

Float * device_table_1 = NULL;

Float host_table1 [thread_size] = {0 };

For (INT I = 0; I

{

Host_table1 [I] = rand () % rand_max;

}

Cuda_safe_call (cudamalloc (void **) & device_table_1, sizeof (float) * thread_size ));

Cuda_safe_call (cudamemcpy (device_table_1, host_table1, sizeof (float) * thread_size, cudamemcpyhosttodevice ));

Unsigned int timer = 0;

Cut_safe_call (cutcreatetimer (& timer ));

Cut_safe_call (cutstarttimer (timer ));

Shared_memory_static> (device_result, 1000, device_table_1 );

// Shared_memory_dynamic> (device_result, 1000, device_table_1, 16 );

// Shared_memory_bankconflict> (device_result, 1000, device_table_1 );

Cut_check_error ("kernel execution failed/N ");

Cuda_safe_call (cudamemcpy (host_result, device_result, sizeof (float) * thread_size, cudamemcpydevicetohost ));

Cut_safe_call (cutstoptimer (timer ));

Printf ("Processing Time: % F (MS)/n", cutgettimervalue (timer ));

Cut_safe_call (cutdeletetimer (timer ));

For (INT I = 0; I

{

Printf ("% F", host_result [I]);

}

Cuda_safe_call (cudafree (device_result ));

Cuda_safe_call (cudafree (device_table_1 ));

Cutilexit (argc, argv );

}

This is just a simple demo. You can test it. In the next section, we will introduce more features of shared memory and further explain some hidden features of shared memory;

The following sections will introduce the use of constant and texture;

The writing content has always been more text and less code. In fact, the learning process is more important. The practical code should be written by yourself. The only thing that can be learned is the idea, more importantly, it is also the exchange of ideas and dissemination of knowledge. The best thing is the dissemination of ideas, code, and methods. They are just some tools. However, the skill level of the tool requires you to practice more on your own.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

9. Cuda shared memory usage-GPU revolution

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

9. Cuda shared memory usage-GPU revolution

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support