Cuda Memory Model:
GPU chip: Register, shared memory;
Onboard memory: local memory, constant memory, texture memory, texture memory, global memory;
Host memory: host memory, pinned memory.
Register: extremely low access latency;
Basic Unit: register file (32bit/each)
Computing power 1.0/1.1 hardware: 8192/Sm;
Computing power 1.2/1.3 hardware: 16384/Sm;
The register occupied by each thread is limited. Do not assign too many private variables to it during programming;
Local Memory: After the register is used up, the data will be stored in the local memory;
Large struct or array;
Cannot determine the size of the array;
Thread input and intermediate variables;
The array initialized while defining the thread's private array is allocated in the register;
Shared Memory: the access speed is similar to the Register speed;
Minimum latency for inter-thread communication;
Save the public result of the counter or block;
Hardware 1.0 ~ In 1.3, 16 Kbyte/SM are organized into 16 banks;
Declare the keyword _ shared _ int sdata_static [16];
Global memory: Memory exists in memory, also known as linear memory (memory can be defined as linear memory or Cuda array );
Cudamalloc () function allocation, cudafree () function release, and cudamemcpy () for data transmission between the host and the device;
To initialize shared storage, you need to call cudamemset ();
Two-dimensional 3D array: cudamallocpitch () and cudamalloc3d () Allocate linear storage space to ensure that the allocation meets alignment requirements;
Cudamemcpy2d (), cudamemcpy3d () and device-side storage are copied;
Host memory: divided into pageable memory and pinned memory
Pageable memory: memory space allocated through the operating system API (malloc (), new ;,
Pinned memory: it is always stored in physical memory and will not be allocated to low-speed virtual memory. It can accelerate communication with the device through DMA;
Cudahostalloc (), cudafreehost () to allocate and release pinned memory;
Advantages of using pinned memory: high data transmission bandwidth between the host and the device;
Some devices can be mapped to the device address space through the zero-copy function, which can be accessed directly from the GPU, saving the data copying between the master memory and the video memory;
Pinned memory cannot be allocated too much: as a result, the physical memory used by the operating system for paging changes, resulting in a decline in the overall performance of the system. Generally, only this thread can have access permissions when the CPU thread is allocated;
In cuda2.3, The pinned memory function is expanded:
Portable memory: enables the host thread controlling different GPUs to operate on the same portable memory to implement inter-CPU communication. When cudahostalloc () is used to allocate pages to lock the memory, the cudahostallocportable flag is added;
Write-combined memory: improves the speed of one-way data transmission from the CPU to the GPU. When the CPU L1 is not used, the L2 cache caches data in a piece of pinned memory, leave the cache resources to other programs. During PCI-E bus transmission, the resources will not be interrupted by CPU monitoring. When cudahostalloc () is called, The cudahostallocwritecombined mark is added; CPU reading speed from this memory is very low;
Mapped memory: Two addresses: Host address (memory address) and device address (video address ). You can directly access the data in the mapped memory in the kernnel program, without having to copy data between the memory and the memory, that is, the zero-copy function. On the host side, you can obtain the data from the cudahostalloc () function, the pointer on the device can be obtained through cudahostgetdevicepointer (); The canmaphostmemory attribute returned by the cudagetdeviceproperties () function determines whether the device supports mapped.
Memory; When cudahostalloc () is called, The cudahow.apped mark is added to map the pinned memory to the device address space. Synchronization must be used to ensure the sequential consistency of CPU and GPU operations on the same memory; some of the video memory can be both portable memory and mapped memory. Before executing the Cuda operation, call cudasetdeviceflags () (with the cudadevicemaphost flag) to implement page lock memory ing.
Constant memory: the read-only address space. It is located in the video memory with cache acceleration and 64kb. It is used to store read-only parameters that require frequent access. It is read-only. It is defined outside all functions using the _ constant _ keyword; two usage methods of constant storage: Initialize constant storage directly when defining; define a constant array and assign values using functions;
Texture memory: Read-only. It is not a dedicated memory, but a texture assembly line involving video memory, two-level texture cache, and texture pick-up units; data is often stored in the display memory in the form of one-dimensional, two-dimensional or three-dimensional arrays; cache acceleration; can declare a much larger size than constant memory; suitable for Image Establishment and search tables; the random access to a large amount of data or non-alignment access has a good acceleration effect; the operations to access the texture memory in the kernel become texture fetching ); the coordinates used for texture picking can be different from the positions of the data in the video memory. The ing method of the two is agreed through the texture reference system. Operations for associating the data in the video memory with the texture reference system are as follows, called texture
Binding); data that can be bound to textures in the video memory include: Common linear memory and Cuda arrays; cache mechanism exists; filtering mode and addressing mode can be set;
The source of this article is: http://blog.csdn.net/ouczoe/article/details/5125621