對應用程式來說,選擇合適的memory object傳輸path可以有效提高程式效能。
下面先看一寫buffer bandwidth的例子:
1. clEnqueueWriteBuffer()以及clEnqueueReadBuffer()
如果應用程式已經通過malloc 或者mmap分配記憶體,CL_MEM_USE_HOST_PTR是個理想的選擇。
有兩種使用這種方式的方法:
第一種:
a. pinnedBuffer = clCreateBuffer( CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR )
b. deviceBuffer = clCreateBuffer()
c. void *pinnedMemory = clEnqueueMapBuffer( pinnedBuffer )
d. clEnqueueRead/WriteBuffer( deviceBuffer, pinnedMemory )
e. clEnqueueUnmapMemObject( pinnedBuffer, pinnedMemory )
pinning開銷在步驟a產生,步驟d沒有任何pinning開銷。通常應用立即程式執行a,b,c,e步驟,而在步驟d之後,要反覆讀和修改pinnedMemory中的資料,
第二種:
clEnqueueRead/WriteBuffer 直接在使用者的memory buffer中被使用。在copy(host->device)資料前,首先需要pin(lock page)操作,然後才能執行傳輸操作。這條path大概是peak interconnect bandwidth的2/3。
2. 在pre-pinned host buffer上使用clEnqueueCopyBuffer()
和1類似,clEnqueueCopyBuffer在pre-pinned buffer上以peak interconnect bandwidth執行傳輸操作:
a. pinnedBuffer = clCreateBuffer( CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_HOST_PTR )
b. deviceBuffer = clCreateBuffer()
c. void *memory = clEnqueueMapBuffer( pinnedBuffer )
d. Application writes or modifies memory.
e. clEnqueueUnmapMemObject( pinnedBuffer, memory )
f. clEnqueueCopyBuffer( pinnedBuffer, deviceBuffer )
或者通過:
g. clEnqueueCopyBuffer( deviceBuffer, pinnedBuffer )
h. void *memory = clEnqueueMapBuffer( pinnedBuffer )
i. Application reads memory.
j. clEnqueueUnmapMemObject( pinnedBuffer, memory )
由於pinned memory駐留在host memroy,所以clMap() 以及 clUnmap()調用不會導致資料轉送。cpu可以以host memory頻寬來操作這些pinned buffer。
3、在device buffer上執行 clEnqueueMapBuffer() and clEnqueueUnmapMemObject()
對於已經通過malloc和mmap分配空間的buffer,傳輸開銷除了interconnect傳輸外,還要包括一個memcpy過程,該過程把buffer拷貝進mapped device buffer。
a. Data transfer from host to device buffer.
1.
ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_WRITE, .. )
由於緩衝被映射為write-only,所以沒有資料從device傳輸到host,映射開銷比較低。一個指向pinned host buffer的指標被返回。
2. 應用程式通過memset(ptr)填充host buffer
memcpy ( ptr, srcptr ), fread( ptr ), 或者直接CPU寫, 這些操作以host memory全速頻寬讀寫。
3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
pre-pinned buffer以peak interconnect速度被傳輸到GPU device。
b. Data transfer from device buffer to host.
1. ptr = clEnqueueMapBuffer(.., buf, .., CL_MAP_READ, .. )
這個命令啟動devcie到host資料轉送,資料以peak interconnect bandwidth傳輸到一個pre-pinned的臨時緩衝中。返回一個指向pinned memory的指標。
2. 應用程式讀、處理資料或者執行 memcpy( dstptr, ptr ), fwrite (ptr), 或者其它類似的函數時候,由於buffer駐留在host memory中,所以操作以host memory bandwidth執行。
3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
由於buffer被映射成唯讀,沒有實際資料轉送,所以unmap操作的cost很低。
4. host直接存取裝置zero copy buffer
這個訪問允許資料轉送和GPU計算同時執行(overlapped),在一些稀疏(sparse)的寫或者更新情況下,比較有用。
a. 一個device上的 zero copy buffer通過下面的命令被建立:
buf = clCreateBuffer ( .., CL_MEM_USE_PERSISTENT_MEM_AMD, .. )
CPU能夠通過uncached WC path直接存取該buffer。 通常可以使用雙緩衝機制,gpu在處理一個緩衝中的資料,cpu同時在填充另一個緩衝中的資料。
A zero copy device buffer can also be used to for sparse updates, such as assembling sub-rows of a larger matrix into a smaller, contiguous block for GPU processing. Due to the WC path, it is a good design choice to try to align writes to the cache line size, and to pick the write block size as large as possible.
b. Transfer from the host to the device.
1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_WRITE, .. )
This operation is low cost because the zero copy device buffer is directly mapped into the host address space.
2. The application transfers data via memset( ptr ), memcpy( ptr, srcptr ), or direct CPU writes.
The CPU writes directly across the interconnect into the zero copy device buffer. Depending on the chipset, the bandwidth can be of the same order of magnitude as the interconnect bandwidth, although it typically is lower than peak.
3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
As with the preceding map, this operation is low cost because the buffer continues to reside on the device.
c. If the buffer content must be read back later, use clEnqueueReadBuffer( .., buf, ..) or clEnqueueCopyBuffer( .., buf, zero copy host buffer, .. ).
This bypasses slow host reads through the uncached path.
5 - GPU直接存取host zero copy memory
This option allows direct reads or writes of host memory by the GPU. A GPU kernel can import data from the host without explicit transfer, and write data directly back to host memory. An ideal use is to perform small I/Os straight from the kernel, or to integrate the transfer latency directly into the kernel execution time.
a:The application creates a zero copy host buffer.
buf = clCreateBuffer( .., CL_MEM_ALLOC_HOST_PTR, .. )
b:Next, the application modifies or reads the zero copy host buffer.
1. ptr = clEnqueueMapBuffer( .., buf, .., CL_MAP_READ | CL_MAP_WRITE, .. )
This operation is very low cost because it is a map of a buffer already residing in host memory.
2. The application modifies the data through memset( ptr ), memcpy( in either direction ), sparse or dense CPU reads or writes. Since the application is modifying a host buffer, these operations take place at host memory bandwidth.
3. clEnqueueUnmapMemObject( .., buf, ptr, .. )
As with the preceding map, this operation is very low cost because the buffer continues to reside in host memory.
c. The application runs clEnqueueNDRangeKernel(), using buffers of this type as input or output. GPU kernel reads and writes go across the interconnect to host memory, and the data transfer becomes part of the
kernel execution.
The achievable bandwidth depends on the platform and chipset, but can be of the same order of magnitude as the peak interconnect bandwidth.
For discrete graphics cards, it is important to note that resulting GPU kernel bandwidth is an order of magnitude lower compared to a kernel accessing a regular device buffer located on the device.
d. Following kernel execution, the application can access data in the host buffer in the same manner as described above.