Transmission optimization of opencl Memory Object

Source: Internet
Author: User

First, we understand the terms and definitions during optimization:

 

1. Deferred allocation (latency allocation ),

When data is transmitted using a memory object for the first time, runtime actually allocates space for the memory object. This reduces resource waste, but it takes longer to use it for the first time. For details, see the previous blog.

 

 

2. Peak interconntect bandwith (peak inline bandwidth)

The host and device transmit data through the PCIe bus. the upstream and downstream bandwidths of pcie2.0 are both 8 Gb/s. For our program, it would be nice to achieve 3 Gbit/s, my laptop test is only 1.2 Gb/s.

 

 

3. Pinning (memory pinning Operation)

When host memory is ready to transmit data to the GPU, it must first perform pinning, that is, lock page (prohibit switching to external storage). The pinning operation has certain performance overhead, the overhead size is related to the host memory size of pinning. The larger the overhead, the larger the overhead. We can allocate host memory to pre-pinned memory to reduce this overhead.

4. WC (write Combined Operation)

WC is a feature used by the CPU to write a fixed address. It binds the adjacent write operation to a cacheline and sends a Write Request to implement batch write operations. [Similar address merge operations are also performed inside the GPU]

 

 

5. Uncached access

Some memory areas are configured as uncache access, and the CPU access is slow, but it is helpful to transmit data to device memory. For example, the previous log mentioned device visible host memory.

 

 

6. USWC (write binding without cache)

GPU access to the host memory of Uncached will not cause cache consistency problems, and the speed will be faster. Because of the high WC speed, CPU reading will be slower. On APU, this operation provides a fast CPU write and GPU read path.

 

Let's take a look at the allocation and usage of buffer:

 

1. Normal Buffer

The buffer created with the cl_mem_read_only/cl_mem_write_only/cl_mem_read_write flag is located in device memory. The GPU can access the buffer with a high bandwidth. For example, for some high-end graphics cards, the buffer exceeds 100 Gb/s, the host can only access these memories through peak interconntect bandwith (PCIe ).

 

2. Zero copy Buffer

This buffer does not actually perform the copy operation (unless the copy operation is specified, such as clenqueuecopybuffer ). Based on the Type parameter of the created buffer, it may be in host memory or device memory.

 

If the device and operating system support zero copy, the following buffer type can be used:

 

• The cl_mem_alloc_host_ptr Buffer
-Zero copy buffer resides on the host.
-The host can be accessed with full bandwidth.
-Device accesses it through interconnect bandwidth.
-This buffer is allocated to the prepinned host memory.

• The cl_mem_use_persistent_mem_amd buffer is
-Zero copy buffer resides in the GPU device.
-GPU can access it with full bandwidth.
-The host can access it with interconnect bandwidth (for example, streamed write bandwidth host-> device, low read bandwidth because no cache is used ).

-Data is transmitted between the host and device over interconnect bandwidth.

 

Note: the size of the buffer to be created is the dependience of the platform. For example, a buffer on a platform cannot exceed 64 MB, and the total buffer cannot exceed MB.

 

Zero copy has good results on Apu. The CPU can write at high speed and the GPU can read at high speed. However, because there is no cache, CPU reading is slow.

1. Buffer = clcreatebuffer (cl_mem_alloc_host_ptr | cl_mem_read_only)
2. Address = clmapbuffer (buffer)
3. memset (Address) or memcpy (Address) (if possible, using multiple CPU
Cores)
4. clenqueueunmapmemobject (buffer)
5. clenqueuendrangekernel (buffer)

For transmission with a small amount of data, the zero copy latency (MAP, unmap, etc.) is usually lower than the corresponding DMA engine latency.

 

3. prepinned Buffer

The pinned buffer type is cl_mem_alloc_host_ptr/cl_mem_use_host_ptr. The buffer is initially created in the prepinned memory. Enqueuecopybuffer transmits data between the host and device with interconnect bandwidth (no pinned or unpinned overhead ).

Note: cl_mem_use_host_ptr can convert an existing host buffer to pinned memory. However, to ensure the transmission speed, the host buffer must be 256 bytes aligned. If it is only used to transmit data, the cl_mem_use_host_ptr type memory object will always be prepinned memory, but it cannot be used as the kernel parameter. If the buffer is to be used in the kernel, runtime will create a buffer cache copy in the device, and the next copy operation will not pass the fast path (to ensure cache consistency ).

 

The following functions support prepinned memory. Note: offset can be used to read memory:
• Clenqueueread/writebuffer
• Clenqueueread/writeimage
• Clenqueueread/writebufferrect (Windows only)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.