OpenCL multi-thread synchronization with source code

Source: Internet
Author: User

AppleDevelopment languageOpenCL MultithreadingThe source code for synchronization is described in this article. First, let's take a look at it.OpenCLThat is, Open Computing Language, which is composedAppleThe company drafted and designed a computing programming language for large-scale parallel computing.

Today we will introduceOpenCL MultithreadingSynchronization skills. The following example describes how to synchronizeThreadAnd how to synchronize between Working Groups.

We have previously introduced the address attributes of variables. Variables modified with _ global are stored in the display memory, featuring a large capacity, but slow access speed and access to all work items; the variable modified with _ local is stored in shared storage, which features a much faster speed than global storage and can be accessed by work items in the same working group, in addition, each working group has its own independent shared storage __private modifier or the variables defined in the default state are private, that is, they are stored in registers, which features fast access, basically, a read or write operation only requires one coloring machine cycle, but it is private to work items, and each work item has only several registers for access.

If we want to synchronize threads in a working group, we can use shared storage variables to help us achieve this goal. for communication between Working Groups, we need to store variables globally.

The kernel code for summation is as follows:

 
 
  1. __kernel void solve_sum(  
  2.                     __global int input[4096],  
  3.                     __global int output[9]  
  4.                     )  
  5. {  
  6.     __local int localBuffer[512];  
  7.  
  8.     size_t item_id = get_local_id(0);  
  9.     size_t gid = get_global_id(0);  
  10.  
  11.     localBuffer[item_id] = input[gid];  
  12.  
  13.     barrier(CLK_LOCAL_MEM_FENCE);  
  14.  
  15.     if((item_id) == 0)  
  16.     {  
  17.         int s = 0;  
  18.         for(int i = 0; i < 512; i++)  
  19.             s += localBuffer;  
  20.         output[get_group_id(0)] = s;  
  21.         output[8] = get_num_groups(0);  
  22.     }  

In the above Code, there are a total of 4096 work items and a total of 8 working groups, so that each working group has 512 work items. This algorithm is very simple. First, store the work items in each working group in the shared array. After all the work items in a working group complete this action, let work item 0 sum the data in the shared storage cache and write it to the output cache of the corresponding workgroup index.

In the above Code, get_local_id obtains the index of the current work item in the current Working Group. The range in the above Code environment is 0 to 511. Therefore, we can change the syntax of localBuffer [item_id] = input [gid]; To localBuffer [gid & 511] = input [gid]; the semantics of the two statements is equivalent.

The thread synchronization function is described as follows:

 
 
  1. void barrier (cl_mem_fence_flags flags) 

This built-in function corresponds to an instruction of the processor and is used to synchronize all work items in a working group. We now take the work item as a thread. When one of the threads executes to barrier, it will be blocked by the processor until all threads in the Working Group execute to this barrier, and then these threads can continue to execute.

The flags parameter indicates whether the storage fence is local or global. We only need local disks, because synchronization between working groups is not required.

We write the results calculated by each working group to the output cache. Because only 8 32-bit data is output, it becomes a piece of cake to take computing in the CPU.

The code for the entire project is provided below: OpenCL_Basic.zip (17 K)

The above code transmits the calculated results of each Working Group to the host. So can we let the GPU solve these eight results together? The answer is yes. However, here we will use the atomic operation extension in OpenCL1.0. In OpenCL1.1, these int32-based atomic operations are formally classified as language cores rather than extensions. We can use OpenCL to query

Whether cl_khr_global_int32_base_atomics is supported. If supported, we can use the following method:

 
 
  1. __kernel void solve_sum(  
  2.                      __global int input[4096],  
  3.                      __global int output[9]  
  4.                      )  
  5.  {  
  6.      __local int localBuffer[512];       
  7.      size_t item_id = get_local_id(0);  
  8.      size_t gid = get_global_id(0);     
  9.      localBuffer[item_id] = input[gid];      
  10.      barrier(CLK_LOCAL_MEM_FENCE);  
  11.      if(item_id == 0)  
  12.      {  
  13.          int s = 0;  
  14.          for(int i = 0; i < 512; i++)  
  15.              s += localBuffer[i];  
  16.          output[get_group_id(0)] = s;      
  17.          int index = atom_inc(&output[8]);  
  18.          if(index == 7)  
  19.          {  
  20.              mem_fence(CLK_GLOBAL_MEM_FENCE);  
  21.              s = 0;  
  22.              for(index = 0; index < 8; index++)  
  23.                  s += output[index];  
  24.              output[8] = s;  
  25.          }  
  26.      }  
  27.  } 

In the above Code, we use the atomic accumulation operation:

 
 
  1. int atom_inc (__global int *p) 

This function first reads the content of the address pointed by the p pointer, increments the content by 1, and then writes it back to the address. The returned value is the previously updated value ). The entire operation is not interrupted, so it is an atomic operation.

In the above Code, we use an index to obtain the return value. If the index is 7, the current thread is the 0th thread in the Working Group of the last write result. Therefore, we use this thread to accumulate eight results and then write them back to the output cache.

If two threads execute atom_inc on the same address at the same time, the GPU will initiate arbitration. It only allows one of them to perform this operation. After this operation is completed, other threads can continue, otherwise, other threads that want to execute this operation will be blocked by the processor.

Therefore, because the output cache is used as the counter variable for global storage, it does not act as a write-only parameter as the first code, but as a readable and writable parameter, the initial data must be transmitted to the GPU device.

The corresponding project and code OpenCL_Basic.zip (17 K) are attached below)

Next we will talk about some advanced topics about Local Memory.

ActuallyOpenCLThe local memory in corresponds to the shared memory in CUDA. When accessing shared memory, if multiple threads write the same shared memory segment memory bank), it will cause the segment to conflict with the bank conflict ).

What is a shared storage segment? A shared memory segment is a 32-bit character in the shared memory. The current mainstream low-end GPU is like this, and the advanced level may be 64-bit or larger ). If the shared storage space of a working group is 1024 kb, there are KB/4B = 32 * segments.

If there are twoThread(Work Item) write operations on the same segment, then these write operations will change from parallel writing to serialized writing, that is, the bus controller willThreadIs serialized, it will select one of themThreadWrite it first, and then select the next one. In this way, multipleThreadFrom the original parallel operations to serial operations, this will be severely punished by performance.

Therefore, we should try to ensure that everyThreadWrite only the corresponding shared storage segments to avoid multipleThreadWrite the same shared storage segment. In the preceding sample code, the read and write data elements are 32-bit, which is exactly the size of a memory segment, in addition, each work item in a working group writes the shared memory with its own id as an index, so that the segments written by each work item are independent of each other, so there will be no segment conflict here.

Summary:AppleDevelopment languageOpenCL MultithreadingThe content of the source code has been synced. I hope this article will help you!

Post address http://www.cocoachina.com/bbs/read.php? Tid-37608.html, welcome to the discussion

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.