Apple's opencl -- talk about local memory

Source: Internet
Author: User

In opencl, variables modified with _ local (or local) are stored in the shared storage area of a compute unit. For NVIDIA GPUs, a Cu can be mapped to a physical Sm (Stream multiprocessor), while for AMD-ATI GPUs, it can be mapped to a physical SIMD. Either SM or SIMD, they all have a shared memory shared by all threads (called work items in opencl) in the current computing unit. Therefore, you can use local shared memory to synchronize all work items in a computing unit.

It must be noted that the thread communication between computing units can only be performed through the global memory, because there is no shared memory between each computing unit.

 

Below I will prove that in Apple's opencl implementation, if there are two work groups (one work group is handed over to one computing unit for execution ), the two work groups can be mapped to a single computing unit. I use Mac Mini, And the GPU is geforce 9400 M. There are only two sm instances.

The following provides the kernel code:

_ KERNEL void solve_sum (<br/> _ global volatile unsigned buffer [512], <br/> _ global unsigned Dest [512] <br/>) <br/>{< br/> _ local volatile int flag = 0; </P> <p> size_t gid = get_global_id (0 ); </P> <p> If (0 <= GID & gid <32) <br/>{< br/> while (flag! = 1); <br/> flag = 0; </P> <p> buffer [GID] = 0x1ul; <br/> // write_mem_fence (clk_global_mem_fence ); <br/>}< br/> else if (32 <= GID & gid <64) <br/>{< br/> flag = 1; </P> <p> while (flag! = 0); <br/> unsigned ret = buffer [31 + 32-gid]; <br/> Dest [GID-32] = ret; <br/>}< br/>} 

The above kernel program is configured as: divided into two working groups; each group has 32 work items. In this way, two working groups can enter different SM. An endless loop occurs when you execute this code. Then the program will automatically exit after 2 to 3 seconds, so don't worry. The reason is that the shared variable flag of the two SM is different. Assuming that threads 0 to 31 enter sm0, all threads of sm0 share the flag variable, while threads 32 to 63 enter SM1, the flag of SM1 will be shared by all threads of sm1. However, if we attempt to use this (actually two) shared variable for communication between two sm instances, it is obviously impossible to succeed. Although only one flag is written in the Code, there are actually two copies.

 

The following provides the host code:

# Import <Foundation/Foundation. h> <br/> # include <opencl/opencl. h> <br/> static unsigned _ attribute _ (aligned (16) buffer [512] = {0 }; // original data set given to device <br/> static unsigned _ attribute _ (aligned (16) Dest [512] = {0 }; <br/> int opencl_execution (void) <br/>{< br/> int err;/error code returned from api cils </P> <p> size_t local; // local domain size for our calculation </P> <P> cl_platform_id platform_id; // added by zenny_chen <br/> cl_device_id device_id; // compute device ID <br/> cl_context context; // compute context <br/> cl_command_queue commands; // compute command queue <br/> cl_program program; // compute Program <br/> cl_kernel kernel; // compute kernel </P> <p> cl_mem memorg, memdst; // device memory used for the input array </P> <p> // create a platform <br/> Err = clgetplatformids (1, & platform_id, null); <br/> If (Err! = Cl_success) <br/>{< br/> printf ("error: failed to create a platform! /N "); <br/> return exit_failure; <br/>}</P> <p> // connect to a compute device <br/> // <br/> err = clgetdeviceids (platform_id, cl_device_type_gpu, 1, & amp; device_id, null); <br/> If (Err! = Cl_success) <br/>{< br/> printf ("error: failed to create a device group! /N "); <br/> return exit_failure; <br/>}</P> <p> // create a compute context <br/> // <br/> context = clcreatecontext (cl_context_properties []) {(cl_context_properties) cl_context_platform, (cl_context_properties) platform_id, 0}, 1, & device_id, null, null, & ERR); <br/> If (! Context) <br/>{< br/> printf ("error: failed to create a compute context! /N "); <br/> return exit_failure; <br/>}</P> <p> // create a Command commands <br/> // <br/> commands = clcreatecommandqueue (context, device_id, 0, & ERR); <br/> If (! Commands) <br/>{< br/> printf ("error: failed to create a Command commands! /N "); <br/> return exit_failure; <br/>}</P> <p> // fetch kernel source <br/> nsstring * filepath = [[nsbundle mainbundle] pathforresource: @ "kernel" oftype: @ "Cl"]; <br/> If (filepath = NULL) <br/>{< br/> puts ("source not found! "); <Br/> return exit_failure; <br/>}</P> <p> const char * kernelsource = (const char *) [[nsstring stringwithcontentsoffile: filepath encoding: nsutf8stringencoding error: Nil] utf8string]; </P> <p> // create the compute program from the source buffer <br/> // <br/> program = clcreateprogramwithsource (context, 1, (const char **) & kernelsource, null, & ERR); <br/> If (! Program) <br/>{< br/> printf ("error: failed to create compute program! /N "); <br/> return exit_failure; <br/>}</P> <p> // build the program executable <br/> // <br/> err = clbuildprogram (Program, 0, null, null, null, null); <br/> If (Err! = Cl_success) <br/>{< br/> size_t Len; <br/> char buffer [2048]; </P> <p> printf ("error: failed to build program executable! /N "); <br/> clgetprogrambuildinfo (Program, device_id, cl_program_build_log, sizeof (buffer), buffer, & Len ); <br/> printf ("% s/n", buffer); <br/> exit (1 ); <br/>}</P> <p> // create the compute kernel in the program we wish to run <br/> // <br/> kernel = clcreatekernel (Program, "solve_sum", & ERR); <br/> If (! Kernel | Err! = Cl_success) <br/>{< br/> printf ("error: failed to create compute kernel! /N "); <br/> exit (1 ); <br/>}</P> <p> // create the input and output arrays in device memory for our calculation <br/> // <br/> memorg = clcreatebuffer (context, cl_mem_read_write, sizeof (INT) * 512, null, null); <br/> memdst = clcreatebuffer (context, cl_mem_write_only, sizeof (INT) * 512, null, null ); </P> <p> If (memorg = NULL | memdst = NULL) <br/> {<br/> printf ("error: failed to allocate device m Emory! /N "); <br/> exit (1 ); <br/>}</P> <p> // write our data set into the input array in device memory <br/> // <br/> err = clenqueuewritebuffer (commands, memorg, cl_true, 0, sizeof (INT) * 512, buffer, 0, null, null); <br/> If (Err! = Cl_success) <br/>{< br/> printf ("error: failed to write to source array! /N "); <br/> exit (1 ); <br/>}</P> <p> // set the arguments to our compute kernel <br/> // <br/> err = 0; <br/> err = clsetkernelarg (kernel, 0, sizeof (cl_mem), & memorg); <br/> err | = clsetkernelarg (kernel, 1, sizeof (cl_mem ), & memdst); <br/> If (Err! = Cl_success) <br/>{< br/> printf ("error: failed to set kernel arguments! % D/N ", err); <br/> exit (1 ); <br/>}</P> <p> // get the maximum work group size for executing the kernel on the device <br/> // <br/> err = clgetkernelworkgroupinfo (kernel, device_id, cl_kernel_work_group_size, sizeof (local), & Local, null); <br/> If (Err! = Cl_success) <br/>{< br/> printf ("error: failed to retrieve kernel Work Group info! % D/N ", err); <br/> exit (1 ); <br/>}< br/> else <br/> printf ("the number of work items in a work group is: % lu/R/N", local ); </P> <p> // execute the kernel over the entire range of our 1D input data set <br/> // using the maximum number of Work Group items for this device <br/> // </P> <p> err = clenqueuendrangekernel (commands, kernel, 1, null, (size_t []) {64}, (size_t []) {32}, 0, null, null); <br/> If (ERR) <br/>{< br/> printf ("error: failed to execute kernel! /N "); <br/> return exit_failure; <br/>}</P> <p> // wait for the Command commands to get serviced before reading back results <br/> // <br/> clfinish (commands); </P> <p> // read back the results from the device to verify the output <br/> // <br/> err = clenqueuereadbuffer (commands, memdst, cl_true, 0, sizeof (INT) * 512, DEST, 0, null, null); <br/> If (Err! = Cl_success) <br/>{< br/> printf ("error: failed to read output array! % D/N ", err); <br/> exit (1 ); <br/>}</P> <p> // validate our results <br/> // <br/> printf ("the result is: 0x %. 8x/N ", Dest [0]); </P> <p> // shutdown and cleanup <br/> // <br/> clreleasememobject (memorg ); <br/> clreleasememobject (memdst); <br/> clreleaseprogram (Program); <br/> clreleasekernel (kernel); <br/> clreleasecommandqueue (commands ); <br/> clreleasecontext (context); </P> <p> return 0; <br/>}< br/> int main (INT argc, const char * argv []) {<br/> NSAID utoreleasepool * Pool = [[NSAID utoreleasepool alloc] init]; <br/> // insert code here... <br/> opencl_execution (); <br/> [pool drain]; <br/> return 0; <br/>}< br/> 

See the host code line 144th:

Err = clenqueuendrangekernel (commands, kernel, 1, null, (size_t []) {64}, (size_t []) {32}, 0, null, null ); 

Here, we set the number of global work items to 64, and each working group has 32 threads, so it is naturally divided into two working groups. If we change 32 to 64, it will become a working group. In this way, the program can be terminated normally if the communication is completely OK through a shared variable in an SM.

In addition, if you want to maintain the original two work groups, you must use global variables for communication:

_ KERNEL void solve_sum (<br/> _ global volatile unsigned buffer [512], <br/> _ global unsigned Dest [512] <br/>) <br/>{< br/> _ local volatile int flag = 0; </P> <p> size_t gid = get_global_id (0 ); </P> <p> If (0 <= GID & gid <32) <br/> {<br/> while (buffer [256]! = 1); <br/> buffer [256] = 0; </P> <p> buffer [GID] = 0x1ul; <br/> // write_mem_fence (clk_global_mem_fence ); <br/>}< br/> else if (32 <= GID & gid <64) <br/>{< br/> buffer [256] = 1; </P> <p> while (buffer [256]! = 0); <br/> unsigned ret = buffer [31 + 32-gid]; <br/> Dest [GID-32] = ret; <br/>}< br/>} 

Note the following. Volatile must be added to all variables used for communication. Otherwise, the opencl kernel compiler optimizes the second access to global variables to retrieve data directly from registers, as a result, the changes to this variable cannot be seen in the current thread.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.