Transfer path with opencl Memory Object

Source: Internet
Author: User

For applications, selecting the appropriate memory object transmission path can effectively improve program performance.

 

The following is an example of buffer bandwidth writing:

 

1. clenqueuewritebuffer () and clenqueuereadbuffer ()

 

If the application has allocated memory through malloc or MMAP, cl_mem_use_host_ptr is an ideal choice.

There are two ways to use this method:

 

First:

A. pinnedbuffer = clcreatebuffer (cl_mem_alloc_host_ptr or cl_mem_use_host_ptr)
B. devicebuffer = clcreatebuffer ()
C. Void * pinnedmemory = clenqueuemapbuffer (pinnedbuffer)
D. clenqueueread/writebuffer (devicebuffer, pinnedmemory)
E. clenqueueunmapmemobject (pinnedbuffer, pinnedmemory)

 

The pinning overhead is generated in step a, and step d does not have any pinning overhead. Generally, the application executes steps A, B, C, and E immediately. After step d, the data in pinnedmemory must be read and modified repeatedly,

 

 

Second:

Clenqueueread/writebuffer is directly used in the user's memory buffer. Before copying (host-> device) data, You need to lock the page before performing the transmission operation. This path is about 2/3 of the peak interconnect bandwidth.

2. Use clenqueuecopybuffer () on the pre-pinned host Buffer ()

 

Like 1, clenqueuecopybuffer performs the transfer operation on the pre-pinned buffer with peak interconnect bandwidth:

 

A. pinnedbuffer = clcreatebuffer (cl_mem_alloc_host_ptr or cl_mem_use_host_ptr)
B. devicebuffer = clcreatebuffer ()
C. Void * Memory = clenqueuemapbuffer (pinnedbuffer)
D. Application writes or modifies memory.
E. clenqueueunmapmemobject (pinnedbuffer, memory)
F. clenqueuecopybuffer (pinnedbuffer, devicebuffer)
Or pass:
G. clenqueuecopybuffer (devicebuffer, pinnedbuffer)
H. Void * Memory = clenqueuemapbuffer (pinnedbuffer)
I. Application reads memory.
J. clenqueueunmapmemobject (pinnedbuffer, memory)

 

Since pinned memory resides in host memroy, clmap () and clunmap () calls do not cause data transmission. The CPU can operate these pinned buffers with the host memory bandwidth.


3. Execute clenqueuemapbuffer () and clenqueueunmapmemobject () on the device buffer ()

 

For the buffer that has been allocated space through malloc and MMAP, the transmission overhead includes a memcpy process in addition to interconnect transmission, which copies the buffer to the mapped device buffer.

A. Data Transfer from host to device buffer.

1.

PTR = clenqueuemapbuffer (..., Buf,..., cl_map_write ,..)
Because the buffer is mapped to write-only, no data is transmitted from the device to the host, and the ing overhead is relatively low. A pointer pointing to the pinned host buffer is returned.

 

 

2. The application fills the host buffer with memset (PTR)
Memcpy (PTR, srcptr), fread (PTR), or direct CPU write. These operations are read and written at the full speed of host memory.

3. clenqueueunmapmemobject (..., Buf, PTR ,..)
Pre-pinned buffer is transmitted to the GPU device at peak interconnect speed.


B. Data Transfer from device buffer to host.

1. PTR = clenqueuemapbuffer (..., Buf,..., cl_map_read ,..)
This command starts devcie to transmit data to the host, and the data is transmitted to a pre-pinned temporary buffer using peak interconnect bandwidth. Returns a pointer to pinned memory.
2. when an application reads, processes data, or executes memcpy (dstptr, PTR), fwrite (PTR), or other similar functions, the buffer stays in host memory, therefore, the operation is executed with host memory bandwidth.

3. clenqueueunmapmemobject (..., Buf, PTR ,..)

Since the buffer is mapped to read-only, there is no actual data transmission, so the cost of the unmap operation is very low.

4. The host directly accesses the device zero copy buffer.

This access allows data transmission and GPU computing to be executed simultaneously (overlapped), which is useful when writing or updating sparse data (sparse.

A. Zero copy Buffer on a device is created using the following command:

Buf = clcreatebuffer (..., cl_mem_use_persistent_mem_amd ,..)

The CPU can directly access the buffer through the Uncached WC path. You can usually use a dual-buffer mechanism. GPU processes data in one buffer, while the CPU simultaneously fills data in another buffer.

A zero copy device buffer can also be used to for sparse updates, such as processing ing sub-rows of a larger matrix into a smaller, contiguous block for GPU processing. due to the WC path, it is a good design choice to try to align writes to the cache line size, and to pick the write block size as large as possible.

 

B. Transfer from the host to the device.
1. PTR = clenqueuemapbuffer (..., Buf,..., cl_map_write ,..)
This operation is low cost because the zero copy device buffer is directly mapped into the host address space.

2. The application transfers data via memset (PTR), memcpy (PTR, srcptr), or direct CPU writes.
The CPU writes directly implements ss the interconnect into the zero copy device buffer. depending on the chipset, the bandwidth can be of the same order of magnsince as the interconnect bandwidth, although it typically is lower than peak.

3. clenqueueunmapmemobject (..., Buf, PTR ,..)
As with the preceding map, this operation is low cost because the buffer continues to reside on the device.
C. If the buffer content must be read back later, use clenqueuereadbuffer (..., Buf,...) or clenqueuecopybuffer (..., Buf, zero copy host buffer ,..).

This bypasses slow Host reads through the Uncached path.

5-GPU direct access to host zero copy memory

 

This option allows direct reads or writes of host memory by the GPU. a gpu kernel can import data from the host without explicit transfer, and write data directly back to host memory. an ideal use is to perform small I/OS straight from the kernel, or to integrate the transfer latency directly into the kernel execution time.

A: The application creates a zero copy host buffer.
Buf = clcreatebuffer (..., cl_mem_alloc_host_ptr ,..)
B: Next, the application modifies or reads the zero copy host buffer.

1. PTR = clenqueuemapbuffer (..., Buf,..., cl_map_read | cl_map_write ,..)

This operation is very low cost because it is a map of a buffer already residing in host memory.
2. the application modifies the data through memset (PTR), memcpy (in either direction), sparse or dense CPU reads or writes. since the application is modifying a host buffer, these operations take place at host memory bandwidth.
3. clenqueueunmapmemobject (..., Buf, PTR ,..)

As with the preceding map, this operation is very low cost because the buffer continues to reside in host memory.
C. The application runs clenqueuendrangekernel (), using buffers of this type as input or output. GPU kernel reads and writes go into ss the interconnect to host memory, and the data transfer becomes part of
Kernel execution.

The achievable bandwidth depends on the platform and chipset, but can be of the same order of magnbandwidth as the peak interconnect bandwidth.

For discrete graphics cards, it is important to note that resulting GPU kernel bandwidth is an order of magn1_lower compared to a kernel accessing a regular device buffer located on the device.

D. Following kernel execution, the application can access data in the host buffer in the same manner as described above.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.