Learning opencl development from scratch (2) a simplest example and simple Performance Analysis

Source: Internet
Author: User

Welcome to repost, please note



1 Hello opencl

Compile a simple example program to demonstrate the basic usage of opencl:

1. You can download an opencl SDK from the developer website of nvdia, AMD, Intel, or all opencl members. Although different companies support different versions of opencl and extended Ext, for standard opencl interfaces in the same version, the implementation results of each SDK are the same, if you only use the standard opencl specification, it doesn't matter which SDK you use. Of course, some companies bind the opencl SDK to a larger SDK, such as nvdia in their Cuda development package, in this case, we only need to put the H and opencl in the Cl folder. lib opencl. DLL file.


Next we will go to the Code Section. In this example, we can add two one-dimensional arrays (this is the most understandable problem of parallel computing). The Code mainly consists of the following parts:


2. obtain all the implemented opencl platforms on the machine:

// Get Platform numbers
Err = clgetplatformids (0, 0, & num );

// Get all platforms
Vector <cl_platform_id> platforms (Num );
Err = clgetplatformids (Num, & platforms [0], & num );


First, you must know what platform on the opencl platform means. We know that different hardware vendors in different opencl organizations support the opencl standard, and each supporter will implement the specific implementation of opencl independently, in this way, if your machine has many different "opencl vendor" hardware (usually implemented in the driver), then there will be several sets of different opencl implementations in your machine, if you have installed an Intel CPU, you may have an Intel implementation, an nvdia video card, or an NVIDIA implementation. It is worth noting that, even if you do not have an AMD graphics card, but you have installed an AMD opencl Development Kit, you may also have an AMD implementation on your machine. Each set of implementations here is a platform. It can be said that the sdks obtained by different vendors may be the same, but the platform in the queried machine may be different. The SDK is the code layer, platform is the Implementation Layer in the driver. opencl is the same in the Code layers of different vendors, but there will be different implementation layers in a machine (I was so cool, but I have been struggling with this problem for a long time ).

Different vendors provide the same code SDK, but in the driver layer, the implementations of different vendors are completely different, that is, paltform is different, for example, NVIDIA's platform only supports n of its own graphics card as a computing device (maybe they think that the CPU as a computing device is a weakness), but AMD's platform not only supports AMD's own devices, it also supports Intel CPU.

Therefore, you need to query all supported platforms on the machine in the program, and then select a proper paltform as needed. (You usually need to select the one with the most powerful capabilities of the compute device. For example, if you find that the client is installed with N cards and the machine has n platforms, select it)

Through the clgetplatforminfo function, you can further obtain more information about the platform (name, CL version, implementer, and so on)


3. query device information (this step can not be done in the program, but can be used to determine the computing power of platform)

// Get device num

Err = clgetdeviceids (platforms [0], cl_device_type_all, 0, 0, & num );
Vector <cl_device_id> did (Num );

// Get all Device
Err = clgetdeviceids (platforms [0], cl_device_type_all, num, & did [0], & num );

// Get device info

Clgetdeviceinfo (...)

The above code can obtain all supported devices in a certain platform (compute device is used here and below, because host device must be your CPU in PC)

This helps you determine which platform is used for better computing.


4. Select a platform and create the context (device context)

// Set property with certain Platform

Cl_context_properties prop [] = {cl_context_platform, reinterpret_cast <cl_context_properties> (platforms [0]), 0 };

Cl_context context = clcreatecontextfromtype (prop, cl_device_type_all, null, null, & ERR );

The code above first sets the context attribute using the selected paltform, and then creates the context using this attribute. After the context is successfully created, your Cl work environment is set up. cl_device_type_all means that you connect all the supported devices in this platform into this context as compute device.


5. Create a commandqueue for each device. Command queue is a messenger that sends commands to each device.

Cqueue [I] = clcreatecommandqueue (context, did [0], 0, 0 );


6. Enter the real stage in device run code: Prepare the kernal Function

First, prepare your kernal code. If you have shader programming experience, you may be familiar with it. here you need to write the function run on each compute item into a binary string, generally, the implementation method is to write a separate file (with a random extension), and then read the file in binary when used in the program.

For example, the kernal code of the array in this example:

_ KERNEL void adder (_ global const float * a, _ global const float * B, _ global float * result)
Int idx = get_global_id (0 );
Result [idx] = A [idx]) + B [idx];

The specific delimiters and functions will be analyzed later. However, the purpose of this Code is to obtain the index idx of the current compute item, and then the members of the two arrays idx are added and stored in a Buf. This code will run on the device in parallel as much as possible.


Name the above file kernal1.cl


Then read it into the string in the Program (you can usually write a tool function for this step)

Ifstream in (_ T ("kernal11.cl"), STD: ios_base: Binary );
If (! In. Good ()){
Return 0;

// Get file length
In. seekg (0, STD: ios_base: End );
Size_t length = in. tellg ();
In. seekg (0, STD: ios_base: Beg );

// Read program source
STD: vector <char> data (Length + 1 );
In. Read (& Data [0], length );
Data [length] = 0;

// Create and build Program
Const char * Source = & Data [0];


In this way, our kernal code is loaded into char * Source.


7. From kernal code to program

In Cl, program represents all the kernal functions used in the program and the functions used. It is an abstract representation of the Code on the device. We need to convert the preceding char * Source to program:


Cl_program program = clcreateprogramwithsource (context, 1, & source, 0, 0 );

Clbuildprogram (Program, 0, 0, 0, 0, 0)


The code above creates a program from the source of the string and builds it (we have said that opencl is a dynamic compiling architecture)


8. Get the kernal Function

Kernal is the abstraction of code and parameters executed on a minimum-granularity compute item in CL (you can understand it as the main function on the CPU ).

We need to first extract the kernal function we want to run from the previous build's program.

Cl_kernel adder = clcreatekernel (Program, "adder", 0 );

9. Prepare the parameters of the kernal function.

The kernal function requires three parameters: input two arrays MEM and output array MEM, which must be created one by one.

The first is the input two mem

STD: vector <float> A (data_size), B (data_size)
For (INT I = 0; I <data_size; I ++ ){
A [I] = I;
B [I] = I;


A B is the two input arrays we want to calculate (note that they are on the CPU or allocated with your motherboard memory)

The variables calculated by Cl must be stored on the device (for example, the video card memory) so that they can get up quickly. Therefore, we must first move the memory, copy the input data from host mem to device mem. The Code is as follows:

Cl_mem CL_A = clcreatebuffer (context, cl_mem_read_only | cl_mem_copy_host_ptr, sizeof (cl_float) * data_size, & A [0], null );
Cl_mem cl_ B = clcreatebuffer (context, cl_mem_read_only | cl_mem_copy_host_ptr, sizeof (cl_float) * data_size, & B [0], null );

The preceding Code uses the host mem pointer to create the read-only MEm of the device.

Finally, you need to allocate the mem for storing the result on the device.

Cl_mem cl_res = clcreatebuffer (context, cl_mem_write_only, sizeof (cl_float) * data_size, null, null );

This is allocated directly on the device.


Finally, set the kernal parameters.

Clsetkernelarg (adder, 0, sizeof (cl_mem), & CL_A );
Clsetkernelarg (adder, 1, sizeof (cl_mem), & cl_ B );
Clsetkernelarg (adder, 2, sizeof (cl_mem), & cl_res );

10. Run the kernal function.

Err = clenqueuendrangekernel (cqueue [0], adder, 1, 0, & work_size, 0, 0, 0, 0 );


Note that the kernal function of Cl is executed asynchronously, so that the CPU can work with the GPU at the same time (but Asynchronization involves synchronization and status query between devices, etc, this is a very complicated part. I will discuss it later)

So the above function will return immediately. clenqueuendrangekernel means to push a kernal function into the commoand queue of a device for execution, the device will execute the commands in its command queue in a certain order. Therefore, whether kernal is executed immediately after this statement is called depends on whether there are other commands in its queue.


11. Copy the result back to the CPU

The execution result is directly written in the device storage. We usually need to copy the result to the CPU memory to continue using the code. Use the following code:

STD: vector <float> res (data_size)

Err = clenqueuereadbuffer (cqueue [0], cl_res, cl_true, 0, sizeof (float) * data_size, & res [0], 0, 0, 0 );

The definition of clenqueuereadbuffer is to release an instruction to the command queue, which is to copy back mem. The cl_true in this example indicates the synchronization of the execution of this instruction, which will block the CPU, therefore, the Code returned indicates that all the commands on the device until the command is executed.

So far, we can get the result of executing the kernla function on the device using CL in res. We can compare it with the execution result of pure CPU, and the result should be consistent.


12. Clean the battlefield

// Release
Clreleasekernel (adder );
Clreleaseprogram (Program );
Clreleasememobject (CL_A );
Clreleasememobject (cl_ B );
Clreleasememobject (cl_res );

For (size_t I = 0; I <num; I ++ ){
Clreleasecommandqueue (cqueue [I]);
Clreleasecontext (context );


2. Performance Analysis

The above is a very simple Cl entry program. With this program, I later made a lot of Performance Analysis and want to know what is the difference between the use of CL to execute the computation and the ordinary on the CPU, and what is the difference in the performance.

I modified the kernal functions of different versions to improve the complexity of kernal operations. These operations are executed on different platforms and on the CPU. The statistical data obtained is as follows:


Complexity 0.1, 2, and 3 respectively use the simple extended array length, power operation, and power operation index.

1. The following data is in milliseconds

2. The first column is the traditional CPU operation, and the second column is the calculation using AMD and NVIDIA platforms.

3. because the amd graphics card is not installed on the test machine, the device used by the amd platform is actually a CPU. Therefore, the situations represented by columns 1, 2, and 3 can be regarded as pure CPU, use the opencl architecture as a computing device using the CPU, and use the opencl architecture as a device using the GPU

4. Because the opencl architecture involves a copy of the memory between the host and the device, the plus signs in column 2 and column 3 represent the time used to copy the memory and the actual computing time respectively.

Computing complexity CPU computing
(Intel e6600 duo core)
AMD Platform + CPU Device
(Intel e6600 duo core)
(Geforce gt440)
1 78
63 + 60 63 + 120
2 1600
63 + 500 63 + 130
3 9600
63 + 1300 63 + 130


From the above table, we can draw some conclusions:

1. the computing of pure CPU increases significantly with the increase of computing complexity. The computing of pure GPU Cl architecture is stable at the same time, although in the first operation, the GPU time is higher than the CPU, but the GPU time is not significantly increased by the third operation, and the CPU time is more than 70 times the GPU time.

2. The Cl implementation on different platforms basically achieves the same time on memory copying. This part of time is irrelevant to the computing complexity and only depends on the memory size. In our example, they are all 63 MS.

3. from the comparison of the 1.2 columns, we can see that even if the CPU is used as the computing, the performance in the CL architecture will be greatly improved, although the first and second columns are actually calculated on the CPU, however, the CL architecture may be switched to a higher layer, using some advanced commands in the CPU or more parallel computing capabilities of the CPU.

4. opencl is truly compatible with various hardware. Unlike cuda, it is of great significance for the development of industrial products. on mainstream machines, you can always find an available opencl platform, and it will prompt performance better than CPU computing.


From this simple performance analysis, we can see that Heterogeneous Computing Using the opencl architecture can greatly improve the traditional computing performance on the CPU, and this increase may increase as the complexity of computing increases, therefore, the growth of so-called "times" and "times" is possible in some computing fields, and using GPU as a device can maximize performance;

At the same time, we should note that Heterogeneous Computing usually involves a large amount of memory copy time, which depends on the bandwidth between your memory and the memory. This part of time cannot be ignored, it runs on the CPU for a shorter period of time than copying between heterogeneous devices. Therefore, it makes no sense to accelerate opencl. That is to say, we should pay attention to the computing complexity, A computation with a low complexity increases the computing time when heterogeneous computation is used. GPU operations all have a "Start Time" unrelated to the computing complexity (for example, this example is around Ms, it is meaningless to put the computing on the GPU when it is executed less than Ms On the CPU .)


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.