Apple's opencl -- talk about local memory

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In opencl, variables modified with _ local (or local) are stored in the shared storage area of a compute unit. For NVIDIA GPUs, a Cu can be mapped to a physical Sm (Stream multiprocessor), while for AMD-ATI GPUs, it can be mapped to a physical SIMD. Either SM or SIMD, they all have a shared memory shared by all threads (called work items in opencl) in the current computing unit. Therefore, you can use local shared memory to synchronize all work items in a computing unit.

It must be noted that the thread communication between computing units can only be performed through the global memory, because there is no shared memory between each computing unit.

Below I will prove that in Apple's opencl implementation, if there are two work groups (one work group is handed over to one computing unit for execution ), the two work groups can be mapped to a single computing unit. I use Mac Mini, And the GPU is geforce 9400 M. There are only two sm instances.

The following provides the kernel code:

_ KERNEL void solve_sum ( _ global volatile unsigned buffer [512], _ global unsigned Dest [512] ) { _ local volatile int flag = 0; size_t gid = get_global_id (0 ); If (0 <= GID & gid <32) { while (flag! = 1); flag = 0; buffer [GID] = 0x1ul; // write_mem_fence (clk_global_mem_fence ); } else if (32 <= GID & gid <64) { flag = 1; while (flag! = 0); unsigned ret = buffer [31 + 32-gid]; Dest [GID-32] = ret; } }

The above kernel program is configured as: divided into two working groups; each group has 32 work items. In this way, two working groups can enter different SM. An endless loop occurs when you execute this code. Then the program will automatically exit after 2 to 3 seconds, so don't worry. The reason is that the shared variable flag of the two SM is different. Assuming that threads 0 to 31 enter sm0, all threads of sm0 share the flag variable, while threads 32 to 63 enter SM1, the flag of SM1 will be shared by all threads of sm1. However, if we attempt to use this (actually two) shared variable for communication between two sm instances, it is obviously impossible to succeed. Although only one flag is written in the Code, there are actually two copies.

The following provides the host code:

# Import <Foundation/Foundation. h> # include <opencl/opencl. h> static unsigned _ attribute _ (aligned (16) buffer [512] = {0 }; // original data set given to device static unsigned _ attribute _ (aligned (16) Dest [512] = {0 }; int opencl_execution (void) { int err;/error code returned from api cils size_t local; // local domain size for our calculation cl_platform_id platform_id; // added by zenny_chen cl_device_id device_id; // compute device ID cl_context context; // compute context cl_command_queue commands; // compute command queue cl_program program; // compute Program cl_kernel kernel; // compute kernel cl_mem memorg, memdst; // device memory used for the input array // create a platform Err = clgetplatformids (1, & platform_id, null); If (Err! = Cl_success) { printf ("error: failed to create a platform! /N "); return exit_failure; } // connect to a compute device // err = clgetdeviceids (platform_id, cl_device_type_gpu, 1, & amp; device_id, null); If (Err! = Cl_success) { printf ("error: failed to create a device group! /N "); return exit_failure; } // create a compute context // context = clcreatecontext (cl_context_properties []) {(cl_context_properties) cl_context_platform, (cl_context_properties) platform_id, 0}, 1, & device_id, null, null, & ERR); If (! Context) { printf ("error: failed to create a compute context! /N "); return exit_failure; } // create a Command commands // commands = clcreatecommandqueue (context, device_id, 0, & ERR); If (! Commands) { printf ("error: failed to create a Command commands! /N "); return exit_failure; } // fetch kernel source nsstring * filepath = [[nsbundle mainbundle] pathforresource: @ "kernel" oftype: @ "Cl"]; If (filepath = NULL) { puts ("source not found! "); return exit_failure; } const char * kernelsource = (const char *) [[nsstring stringwithcontentsoffile: filepath encoding: nsutf8stringencoding error: Nil] utf8string]; // create the compute program from the source buffer // program = clcreateprogramwithsource (context, 1, (const char **) & kernelsource, null, & ERR); If (! Program) { printf ("error: failed to create compute program! /N "); return exit_failure; } // build the program executable // err = clbuildprogram (Program, 0, null, null, null, null); If (Err! = Cl_success) { size_t Len; char buffer [2048]; printf ("error: failed to build program executable! /N "); clgetprogrambuildinfo (Program, device_id, cl_program_build_log, sizeof (buffer), buffer, & Len ); printf ("% s/n", buffer); exit (1 ); } // create the compute kernel in the program we wish to run // kernel = clcreatekernel (Program, "solve_sum", & ERR); If (! Kernel | Err! = Cl_success) { printf ("error: failed to create compute kernel! /N "); exit (1 ); } // create the input and output arrays in device memory for our calculation // memorg = clcreatebuffer (context, cl_mem_read_write, sizeof (INT) * 512, null, null); memdst = clcreatebuffer (context, cl_mem_write_only, sizeof (INT) * 512, null, null ); If (memorg = NULL | memdst = NULL) { printf ("error: failed to allocate device m Emory! /N "); exit (1 ); } // write our data set into the input array in device memory // err = clenqueuewritebuffer (commands, memorg, cl_true, 0, sizeof (INT) * 512, buffer, 0, null, null); If (Err! = Cl_success) { printf ("error: failed to write to source array! /N "); exit (1 ); } // set the arguments to our compute kernel // err = 0; err = clsetkernelarg (kernel, 0, sizeof (cl_mem), & memorg); err | = clsetkernelarg (kernel, 1, sizeof (cl_mem ), & memdst); If (Err! = Cl_success) { printf ("error: failed to set kernel arguments! % D/N ", err); exit (1 ); } // get the maximum work group size for executing the kernel on the device // err = clgetkernelworkgroupinfo (kernel, device_id, cl_kernel_work_group_size, sizeof (local), & Local, null); If (Err! = Cl_success) { printf ("error: failed to retrieve kernel Work Group info! % D/N ", err); exit (1 ); } else printf ("the number of work items in a work group is: % lu/R/N", local ); // execute the kernel over the entire range of our 1D input data set // using the maximum number of Work Group items for this device // err = clenqueuendrangekernel (commands, kernel, 1, null, (size_t []) {64}, (size_t []) {32}, 0, null, null); If (ERR) { printf ("error: failed to execute kernel! /N "); return exit_failure; } // wait for the Command commands to get serviced before reading back results // clfinish (commands); // read back the results from the device to verify the output // err = clenqueuereadbuffer (commands, memdst, cl_true, 0, sizeof (INT) * 512, DEST, 0, null, null); If (Err! = Cl_success) { printf ("error: failed to read output array! % D/N ", err); exit (1 ); } // validate our results // printf ("the result is: 0x %. 8x/N ", Dest [0]); // shutdown and cleanup // clreleasememobject (memorg ); clreleasememobject (memdst); clreleaseprogram (Program); clreleasekernel (kernel); clreleasecommandqueue (commands ); clreleasecontext (context); return 0; } int main (INT argc, const char * argv []) { NSAID utoreleasepool * Pool = [[NSAID utoreleasepool alloc] init]; // insert code here... opencl_execution (); [pool drain]; return 0; }

See the host code line 144th:

Err = clenqueuendrangekernel (commands, kernel, 1, null, (size_t []) {64}, (size_t []) {32}, 0, null, null );

Here, we set the number of global work items to 64, and each working group has 32 threads, so it is naturally divided into two working groups. If we change 32 to 64, it will become a working group. In this way, the program can be terminated normally if the communication is completely OK through a shared variable in an SM.

In addition, if you want to maintain the original two work groups, you must use global variables for communication:

_ KERNEL void solve_sum ( _ global volatile unsigned buffer [512], _ global unsigned Dest [512] ) { _ local volatile int flag = 0; size_t gid = get_global_id (0 ); If (0 <= GID & gid <32) { while (buffer [256]! = 1); buffer [256] = 0; buffer [GID] = 0x1ul; // write_mem_fence (clk_global_mem_fence ); } else if (32 <= GID & gid <64) { buffer [256] = 1; while (buffer [256]! = 0); unsigned ret = buffer [31 + 32-gid]; Dest [GID-32] = ret; } }

Note the following. Volatile must be added to all variables used for communication. Otherwise, the opencl kernel compiler optimizes the second access to global variables to retrieve data directly from registers, as a result, the changes to this variable cannot be seen in the current thread.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Apple's opencl -- talk about local memory

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Apple's opencl -- talk about local memory

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support