Original http://www.olcf.ornl.gov/training_articles/opencl-vector-addition/
This article is just a translation of opencl.
The example in the original article cannot be run in my environment, so some changes have been made.
Through this example, we can better understand the opencl programming model.
1. Introduction
This example indicates the addition of two vectors, which can be considered as "Hello World" in opencl ". To enableProgramIt is easier to understand and does not include an error handling mechanism.
// Vecadd. C # include <stdio. h> # include <stdlib. h> # include <math. h> # include <CL/opencl. h> // opencl kernel. each work item takes care of one element of cconst char * kernelsource = "\ n" \ "_ KERNEL void vecadd (_ global float *, \ n "\" _ global float * B, \ n "\" _ global float * C, \ n "\" const unsigned int N) \ n "\" {\ n "\" // get our global thread ID \ n "\" int id = get_global_id (0 ); \ n "\" \ n "\" // make sure we do not go out of bounds \ n "\" If (ID <n) \ n "\" C [ID] = A [ID] + B [ID]; \ n "\"} \ n "\" \ n "; int main (INT argc, char * argv []) {// vector length int n = 8; // input vector int * h_a; int * h_ B; // output vector int * h_c; // device input buffer cl_mem d_a; cl_mem d_ B; // device output buffer cl_mem d_c; cl_platform_id cpplatform; // The cl_device_id device_id OF THE opencl platform; // device ID cl_context context; // context cl_command_queue queue; // command queue cl_program program; // program cl_kernel kernel; // kernel // (number of bytes per vector) size_t bytes = N * sizeof (INT); // (allocate memory for each vector) h_a = (int *) malloc (bytes); h_ B = (int *) malloc (bytes); h_c = (int *) malloc (bytes); // (initialization vector) int I; for (I = 0; I <n; I ++) {h_a [I] = I; h_ B [I] = I;} size_t globalsize, localsize; cl_int err; // (number of worker nodes per Working Group) localsize = 2; // (all worker nodes) globalsize = (size_t) Ceil (N/(float) localsize) * localsize; printf ("% d \ n", globalsize ); // (obtain the platform ID) Err = clgetplatformids (1, & cpplatform, null); // (obtain the device ID, which is related to the platform) Err = clgetdeviceids (cpplatform, cl_device_type_cpu, 1, & device_id, null); // (obtain the context based on the device ID) Context = clcreatecontext (0, 1, & device_id, null, null, & ERR ); // (create a command queue on the device based on the context) queue = clcreatecommandqueue (context, device_id, 0, & ERR); // (create a computing program based on the opencl source program) program = clcreateprogramwithsource (context, 1, (const char **) & kernelsource, null, & ERR); // (create executable program) clbuildprogram (Program, 0, null, null, null, null); // (create a kernel program in the program created above) kernel = clcreatekernel (Program, "vecadd", & ERR ); // (allocate device buffer) d_a = clcreatebuffer (context, cl_mem_read_only, bytes, null, null); d_ B = clcreatebuffer (context, cl_mem_read_only, bytes, null, null ); d_c = clcreatebuffer (context, cl_mem_write_only, bytes, null, null); // (write vector information to the device buffer) Err = clenqueuewritebuffer (queue, d_a, cl_true, 0, bytes, h_a, 0, null, null); err | = clenqueuewritebuffer (queue, d_ B, cl_true, 0, bytes, h_ B, 0, null, null ); // (set the kernel parameters) Err = clsetkernelarg (kernel, 0, sizeof (cl_mem), & d_a); err | = clsetkernelarg (kernel, 1, sizeof (cl_mem ), & d_ B); err | = clsetkernelarg (kernel, 2, sizeof (cl_mem), & d_c); err | = clsetkernelarg (kernel, 3, sizeof (INT), & N ); // (execute the kernel within the dataset range) execute the kernel over the entire range of the data set err = clenqueuendrangekernel (queue, kernel, 1, null, & globalsize, & localsize, 0, null, null); // (wait until the command queue is executed) Wait for the command queue to get serviced before reading back results clfinish (Queue ); // (read the result from the device buffer) read the results from the device clenqueuereadbuffer (queue, d_c, cl_true, 0, bytes, h_c, 0, null, null ); // (output read result) float sum = 0; for (I = 0; I <n; I ++) printf ("% d", h_c [I]); // (release resources) clreleasememobject (d_a); clreleasememobject (d_ B); publish (d_c); clreleaseprogram (Program); clreleasekernel (kernel); clreleasecommandqueue (Queue ); clreleasecontext (context); // (release memory) Free (h_a); free (h_ B); free (h_c); System ("pause"); Return 0 ;}
2. Basic explanation
2.1 kernel:
Kernel is openclCodeCore. All kernel must be read as a C string. The easiest way is to enclose the entire kernel with quotation marks and press enter at the end of the line. In a real program, the kernel should be placed in a separate file.
// Opencl kernel. each work item takes care of one element of cconst char * kernelsource = "\ n" \ "_ KERNEL void vecadd (_ global float *, \ n "\" _ global float * B, \ n "\" _ global float * C, \ n "\" const unsigned int N) \ n "\" {\ n "\" // get our global thread ID \ n "\" int id = get_global_id (0 ); \ n "\" \ n "\" // make sure we do not go out of bounds \ n "\" If (ID <n) \ n "\" C [ID] = A [ID] + B [ID]; \ n "\"} \ n "\" \ n ";
Check what the simple kernel consists:
_ KERNELVoidVecadd (_ globalFloat* A, _ globalFloat* B,
_ GlobalFloat* C,ConstUnsignedIntN)
_ KERNEL indicates that this is an opencl kernel, __global indicates that the Pointer Points to the global device memory space, and the other is the c Function syntax. The kernel must return an empty type.
IntId = get_global_id (0);
Obtain the ID of the 0th-dimension global worker node.
If(ID <n)
C [ID] = A [ID] + B [ID];
The number of working groups must be an integer, or the number of work nodes of each working group must be divisible by the total number of work nodes. Because the size of a group is used to coordinate performance, there is no need to be divisible by the number of all threads. Therefore, generally, more threads are enabled than the required threads, and redundant threads are ignored. After investigating the problem domain, you can access and operate the device memory.
2.2Memory:
//Input vector
Int* H_a;
Int* H_ B;
//Output Vector
Int* H_c;
//Device input buffer
Cl_mem d_a;
Cl_mem d_ B;
//Device output buffer
Cl_mem d_c;
The CPU and GPU have different Memory Spaces. Therefore, the memory must be referenced separately. One Mart host array pointer and the other set is the operation handle of the device memory. Here we use the prefixes H _ and D _ to distinguish them.
2.3 thread ing:
//(Number of work nodes in each working group)
Localsize =2;
//(All work nodes)
Globalsize = (size_t) Ceil (N /(Float) Localsize) * localsize;
To map the problem to the underlying hardware, you must specify the local size and global size. The local size defines the number of nodes in the Working Group. This is equivalent to the number of threads in the thread block on the sub-nvidia gpu. The global size defines the number of worker nodes started. Localsize must be divisible by globalsize. Therefore, we calculate a minimum integer that can overwrite the problematic domain and be divisible by localsize.
2.4 Environment Configuration:
//(Platform binding)
Err = clgetplatformids (1, & Cpplatform, null );
Each hardware provider has a different platform, which should be given before use. HereClgetplatformids () assigns cpplatform to a platform available to the system. For example, if the system includes amd cpu and nvidia gpu and both platforms have an appropriate opencl driver installed, the platform is available here. (Note: to use different platform drivers, you must install the relevant drivers. In this example, I installed AMD (ATI) app SDK v2.5 and Intel's intel_ocl_sdk_1.5_runtime_setup, therefore, there are two platforms, but because my ATI graphics GPU cannot be supported by app sdk v2.5, the GPU device is not used to obtain the device ID, but the CPU device is used. If the configuration here is incorrect, the following may fail)
//(Obtain the device ID, which is related to the platform)
Err = clgetdeviceids (cpplatform, cl_device_type_cpu,1, & Device_id, null );
You can query the platform to find out what devices it contains. In this example, use the enumerated ValueCl_device_type_cpu is used to query CPU devices on the platform.
//(Create a command queue on the device according to the context)
Queue = clcreatecommandqueue (context, device_id,0, & ERR );
Before using the opencl device, you must configure context to manage command queues, memory, and kernel activities. A context can contain more than one device.
//(Create a command queue on the device according to the context)
Queue = clcreatecommandqueue (context, device_id,0, & ERR );
The command queue is used to place commands from the host into the specified device. Both memory transfer and kernel activity can be put into the command queue and executed on the specified device at the right time.
2.5 compile the kernel:
// (create a computing program based on the opencl source program)
program = clcreateprogramwithsource (context, 1 ,
( const char **) & kernelsource, null, & ERR );
// (create an executable program)
clbuildprogram (program, 0 , null);
// (create a kernel program in the program created above)
kernel = clcreatekernel (Program, " vecadd " , & ERR);
To ensure the portability of most devices, the default way to run the kernel is to use real-time (just-in-time) to compile the source code for the device with a given context. First, create a program, which is a collection of kernel programs, and then create their own kernel programs based on the program.
2.6 prepare data:
// (Allocate device buffer)
D_a = clcreatebuffer (context, cl_mem_read_only, bytes, null, null );
D_ B = clcreatebuffer (context, cl_mem_read_only, bytes, null, null );
D_c = clcreatebuffer (context, cl_mem_write_only, bytes, null, null );
// (Write vector information to the device buffer)
Err = clenqueuewritebuffer (queue, d_a, cl_true, 0 ,
Bytes, h_a, 0 , Null, null );
Err | = clenqueuewritebuffer (queue, d_ B, cl_true, 0 ,
Bytes, h_ B, 0 , Null, null );
// (Set Computing kernel parameters)
Err = clsetkernelarg (kernel, 0 , Sizeof (Cl_mem), & d_a );
Err | = clsetkernelarg (kernel, 1 , Sizeof (Cl_mem), & d_ B );
Err | = clsetkernelarg (kernel, 2 , Sizeof (Cl_mem), & d_c );
Err | = clsetkernelarg (kernel, 3 , Sizeof ( Int ), & N );
Before starting the kernel, you must create a buffer between the device and the host, bind the host data to the newly created device buffer, and set the kernel parameters.
2.7 start the kernel:
// (Execute the kernel within the dataset range)
Err = clenqueuendrangekernel (queue, kernel, 1 , Null, & globalsize, & localsize,
0 , Null, null );
Once the memory is deployed on the device, the kernel can be queued for startup.
2.8 retrieve results:
//(Wait for the command queue to complete execution before reading the result)
Clfinish (Queue );
//(Read the result from the device buffer zone)
Clenqueuereadbuffer (queue, d_c, cl_true,0,
Bytes, h_c,0, Null, null );
It can be blocked until all command queues are executed and the results on the device are retrieved back to the host.
3. Runtime Environment
3.1 opencl:
AMD app SDK v2.5
Intel_ocl_sdk_1.5_runtime
3.2 Visual Studio 2010 Express