Change chapter friends in Shanghai shopping when suddenly thought of ... There were a few notes about memory execution, so I went home to write it my dissertation.
Farewell attention, reprint quote please specify http://blog.csdn.net/leonwei/article/details/8909897
This will further illustrate some of the features of the OpenCL API
1. Create buffer
The operation that touches memory and graphics is always complicated, and this function is the same ...
Cl_memclcreatebuffer ( |
Cl_context context, |
|
Cl_mem_flags flags, |
|
size_t size, |
|
void *host_ptr, |
|
Cl_int *errcode_ret) |
The function creates (or assigns) a buffer and returns. All of this is created by global Mem. CL automatically manages the global to a more private copy, depending on the execution. The buffer concept here is for kernal function calculations (or, for device access, what device. Host is C + + write the control program, must run in Cpu,device is to perform kernal calculation, running in all the calculated power of the processor, sometimes your CPU at the same time performing host and device, sometimes with the GPU device), Here the memory of the host and device is blurred, that is, depending on the flag, it can be on the host, but also on the device, only the memory allocated here can be used for the execution of the Kernal function.
Important parameters in flags, these parameters can be |
1 cl_mem_read_write: Open a kernal readable writable memory on the device, which is the default
2 cl_mem_write_only: Develop a kernal memory that can only be written on device
3 cl_mem_read_only: Develop a kernal memory that is readable only on device
4 cl_mem_use_host_ptr: Direct application of an already allocated MEM for device application in host, note: Although this is using the memory already existing on the host, but the value of this memory will not necessarily and after the Kernal function calculation of the actual value, That is, the application of the Clenqueuereadbuffer function copy back memory and the original memory is a different kind of, or can think of OpenCL although borrowed this memory as CL_MEM, but in fact not guaranteed synchronization, but the initial value is the same, (Can be synchronized by applying MAPMEM, etc.)
5 cl_mem_alloc_host_ptr: New development of memory for device application on HOST
6 Cl_mem_copy_host_ptr: Develop a memory for device application on device and assign value to an existing MEM on HOST
7 cl_mem_host_write_only: This memory is only writable by HOST
8 Cl_mem_host_read_only: This block of memory is HOST-readable
9 cl_mem_host_no_access: This piece of memory is HOST readable and writable
To talk about these flags, these flags appear to be more complex and chaotic because OpenCL is a framework for cross-hardware platforms, so it is more abstract to take care of all aspects and more unification.
First 456 of the distinction, they are related to the memory on the host, the distinction is, 4 is directly applied to the existing, 5 is a new development, 6 is on the device to open the memory, but the initial value and host the same (45 is open memory on the host)
Then look at 123 and 789,123 are said for access to the Kernal function, and 789 is for host access, the Kernal function is access to the device, while access to the Kernal function is basically access to the host (such as enqueueread/ Write these actions)
The efficiency of using memory on the host on weekdays is not as efficient as applying a device, and creating read-only memory is more efficient than creating writable memory (we all know that there are many kinds of memory chunks on the GPU, the fastest is the constant area, where it is used to create read-only device memory)
Weekdays in various ways to open the memory of your program are work, but here will test the different circumstances of the optimization of the skill
Size parameter: The amount of memory to open
Host_ptr parameter: Only in 4.6 cases, others are null
Of course, all of this memory has to be applied clreleasememobject release
Call_back of Memory:
Some ways, such as CL_MEM_USE_HOST_PTR,CL_MEM application storage space reality is on the HOST MEM, so we have to carefully dispose of this main memory, such as you delete it, but Cl_mem still use it, there will be problems, and clreleasememobject actually not necessarily will immediately delete this cl_mem, it is just a reference count of the reduction, here need a callback, tell us when this main memory can be relieved of the cleanup lost, It's clsetmemobjectdestructorcallback.
The CL specification specifically indicates that it is best not to take time-consuming systems and CL APIs in this callback.
2. Memory operation
1 read back from Cl_mem host mem (even if the cl_mem is directly applied to the host mem implementation, want to read its content, or to read it back, can be seen as CL_MEM is a higher layer package)
Clenqueuereadbuffer
2 Apply the value of Host_mem write Cl_mem
Clenqueuewritebuffer
3 mapping between Cl_mem and host Mem
Clenqueuemapbuffer
This function is very special, recall that the above section in the creation of buf a method cl_mem_use_host_ptr, is directly to the device application HOST on the existing piece of MEM (p1) do buf, but the resulting cl_mem (P2) calculated value will change, P2 change after the weekday P1 will not be changed, because although the use of a physical space, but Cl_mem is a high-level package, and host on the Mem or the manifold, to make the P1 synchronization to the latest value of P2, it is called this map
Comparison of the performance of map and Copyback daily
Even if youth is a delicate flower, but I understand that a single place is never spring, spring is a colorful world. Even if youth is a tree of Earth, but I understand that a single show is never upright, row into rows of trees, is the green wall of the windshield sand. Even if youth is a leaf of the sea Pride sail, but I understand that a lone sail is very difficult to sail, thousand sails is the sea spectacular.
And then I thought, what's the difference between this and the application of Clenqueuereadbuffer from P2read to P1? Map method is faster, because p1p2 after all a physical address, map is not to do a conversion, and read more than copy operation. And it should be faster when the CPU makes the device, but the truth is. In the spirit of root planing problem, I really did a test,
My test result is this, if the application CPU does HOST,GPU do device, then Copyback is faster, but if the application CPU do HOST,CPU also do device, then map faster (not across the hardware), and overall Cpu+gpu way faster.
The results of this experiment completely overturned some of my initial attempts, the test data explained 1. Regardless of the hardware differences, the map is much faster than the copyback, as I know, two sets of data from the CPU to do the device can be seen. 2. At least in my experiment, the data copy between main memory and memory is faster than main memory to main memory's own data copy, so in the CPU+GPU architecture, because the Copyback mode uses the main memory memory copy, and the map value touches the operation on main memory, so copyback faster. However, I still have doubts here, my analysis is probably wrong, or there are other factors not considered, in this regard, to re-inherit to check the memory of the pinned and Ram data transfer of some knowledge.
So in this heterogeneous computing category, performance and your hardware architecture, performance, composition has a very important association, so the best way is to do experiment comparison.
4 make copy directly in Cl_mem
Clenqueuecopybuffer
These functions are implemented as kernal in the command queue of the device, but they all come with a parameter blocking_read, which specifies whether the function returns after the end of execution.
3.Program
3.1.compile Build Link
There are two ways to create a program from text
Direct Build:clbuildprogram first Complie good, according to the situation dynamic link, that is, the above process is divided into two paces
Clcompileprogram Cllinkprogram
But 1.2 of the way is not insured, this is CL1.2 participation, and not all of the platform support to 1.2,nvidia seems to be just 1.1
OpenCL actually compiles a copy of the sample code into different machine languages, such as CPU assembly or GPU assembly, based on different hardware.
4.Kernal of execution
This is the essence.
1. Setting the parameters of the Kernal
Clsetkernelarg
2. Executive Kernal
Clenqueuendrangekernel
First give a section of kernal code, to facilitate the interpretation of the parameters below, but also here need some space to imagine the ability ~
Kernal Code
__kernel void Adder (__global const float* A, __global const float* B, __global float* result)
{
int idx = get_global_id (0);//Gets the ordinal number on the 0 dimension of the current cell
RESULT[IDX] = A[idx] + B[IDX];
}
Parameter description:
Command_queue: Executes the command sequence for that device
Kernel: kernal obj to be executed
Work_dim: We know that CL is performed in a separate compute unit, and you can imagine that the unit is a line or a two-dimensional square, or even a cube, or a higher dimension, where the parameters describe the dimension of the execution, from 1 to Cl_ Between device_max_work_item_dimensions
Global_work_size: The number of units per dimension, so that the number of unit overall BOM publishing has will be global_work_size[0]*global_work_size[1] ...
Global_work_offset: Here is the code above the first get_global_id () 0 in each dimension to get the ID, the default is 0, for example, calculate a one-dimensional length of the task of 255work_size, CL will automatically virtual out 255 units of calculation, Each unit calculates the sum of the number of 0-254 positions, and if you set him to 3, then CL will count from 3, that is, the unit at 3-254 is going to figure out the result, and 0-2 of these units are basically not involved in the calculation.
Local_work_size: The unit described in the previous CL can be combined into groups (within the same group can communicate with each other) This parameter is the resolution of each group of CL size, NULL when CL will automatically find you a suitable, Here I try to use different sizes of the group to do array addition efficiency,
In fact, there is not much to see, intuition for this application instance is the group less faster, but it is not a strict linear relationship, whether on the CPU or GPU this relationship is approximate, so in real development, we choose what dimension. Choose what kind of group size. My answer is: Do more experiments, or to be lazy, put 0, to CL for you (in real-time, many functions in CL have this null adaptive option). )
About dimensions, offsets, Worksize here's a picture of the original, a more abstract explanation.
The following parameters are related to synchronization.
Event_wait_list and Num_events_in_wait_list: The execution of this command is to wait for these event to execute.
Event: will return the event associated with this command
It is clear that the event with these parameters can control the order of execution between the command.
5. Order and synchronization of order execution
Command execution By default is asynchronous, this is conducive to parallelism to improve efficiency, in parallel problems we sometimes have to do something synchronous, or wait for an asynchronous operation implementation, there are two ways:
Apply Enqueueread/write These actions can specify that they are to be tracked by applying an event for synchronization (that is, the end of execution is returned on the host), operations such as Clenqueuendrangekernel City Association an event
Event
Clenqueue Such an operation the city Association returns an event user can create a custom event clcreateuserevent on their own, to apply clreleaseevent release
Operations on event:
Just synchronize the different command with the event:
To set the event status:
Set the state of the user custom event, the Clsetusereventstatus state can only be set once, can only be cl_complete or a negative value, Cl_complete represents the event implementation, waiting for its command to execute , and negative values indicate an error, and all pending command executions are canceled. In fact, the status of the event also has cl_running cl_submitted cl_queued, just can't set here.
Wait for event
clwaitforevents; You can wait for the end of some event in host, such as an asynchronous operation such as Clenqueuendrangekernel, and you can wait for the end of his event to mark that it's done.
Query Event information: Clgeteventinfo Clgeteventprofilinginfo set callback: Clseteventcallback
Event on different device:
Clenqueuendrangekernel Such an operation waits only on an event (that is, the same device) that is in the same queue, while synchronizing the event on a different queue can only be performed in a way that such as clwaitforevents and so on.
Marker
Marker is an object that can be seen as an empty instruction into the queue, dedicated to synchronization, which can set the waiting event to other comman as well as the Operation Clenqueuemarkerwithwaitlist
Barrier
Barrier and marker are very similar, but the biggest difference from the name is that marker will automatically execute the end after waiting for its attachment event, allowing subsequent instructions to execute, and barrier will block here. Until his associated event is shown to be set to the implementation state