CUDA----Memory Model

Source: Internet
Author: User
Tags nvcc


The performance of kernel can not be explained simply from the execution of Warp. For example, the previous post involved that setting the block dimension to half of the warp size would cause the load efficiency to be lowered, which could not be explained by warp scheduling or parallelism. The root cause is that the way to get global memory is poor.

It is well known that the operation of memory occupies a very important position in the language that emphasizes efficiency. Low-latency and High-bandwidth are ideal conditions for high performance. But buying a memory with a large capacity and high performance is unrealistic, or not economical. Therefore, we should try to rely on the software level to obtain the best latency and bandwidth. Cuda divides the memory model unit into two systems, device and host, which exposes its RAM structure for us to operate, giving users ample flexibility to use.

Benefits of a Memory Hierarchy

In general, program access to resources is regular, that is, computer architecture often referred to the local principle. It is divided into time locality and spatial locality. I believe that everyone is familiar with the computer memory knowledge, here is not much to say, only simple mention.

The main memory of GPU and CPU is realized by DRAM, and the cache is realized by using lower-latency SRAM. The storage structure of the GPU and CPU is basically the same. Moreover, CUDA presents the memory structure to the user better, thus allowing for more flexible control program behavior.

CUDA Memory Model

For programmers, memory can be divided into the following two categories:

    • Programmable: We can operate the part flexibly.
    • Non-programmable: Can not operate, by a set of automatic mechanism to achieve good performance.

In the storage structure of the CPU, both the L1 and L2 caches are non-programmable. For Cuda, the type of programmable is rich:

    • Registers
    • Shared Memory
    • Local Memory
    • Constant Memory
    • Texture Memory
    • Global Memory

Shows the structure of the memory, each with their own space, lifetime, and cache.

Where constant and texture are read-only. The bottom three global, constant, and texture have the same life cycle.


Registers are the fastest memory,kernel in the GPU, and there are no special declarations of automatic variables that are placed in registers. When the index of an array is of type constant and can be determined at compile time, it is built-in type, and arrays are placed in registers.

Register variables are private to each thread, and once the thread execution is complete, the register variable is invalidated. Registers are scarce resources. On Fermi, each thread limit has a maximum of 63 register,kepler and 255. Allowing your kernel to use fewer registers will allow more blocks to reside in SM, adding occupancy and improving performance.

Using NVCC's -xptxas-v,-abi=no (here Xptxas indicates that this is to be passed to ptx parameter, not NVCC, V is Verbose,abi forget, as if application by interface) option to see the number of registers per thread used, the size of the shared memory and the constant memory. If the register used by kernel exceeds the hardware limit, this part uses local memory instead of register, the so-called register spilling, we should try to avoid this situation. The compiler has the appropriate strategy to minimize the use of the register and to avoid register spilling. We can also explicitly add additional information in the code to help the compiler optimize:

void __launch_bounds__ (Maxthreadsperblock, minblockspermultiprocessor) kernel (...) {    //  your kernel body}

Maxthreadsperblock Indicates the maximum number of thread per block that can be contained. Minblockspermultiprocessor is an optional parameter that indicates the minimum number of blocks necessary.

We can also use -maxrregcount= 32来 To specify the maximum number of registers used by the kernel. If __launch_bounds__ is used, the 32 specified here will be invalidated.

Local Memory

Sometimes, if register is not enough, the local memory is used instead of this part of the register space. In addition to the following scenarios, the compiler may place variables in the local memory:

    • The compilation period cannot determine the exact value of the local array.
    • Larger structures or arrays, which are variables that may consume a large number of registers.
    • Any variable that exceeds the register limit.

The name of local memory is ambiguous: the variables in the local memory are essentially in the same storage area as the global memory. Therefore, the local memory has very high latency and low bandwidth. Above CC2.0, the GPU will have L1 (PER-SM) and L2 (Per-device) level two cache for local memory.

Shared Memory

Variables modified with the __shared__ modifier are stored in the shared memory. Because the shared memory is on-chip, he has a high bandwidth and a much lower latency compared to localmemory and global memory. His use is very similar to the l1cache of the CPU, but he is programmable.

As is customary, memory such as this kind of performance is limited, and shared memory is allocated in block units. We must use shared memory very carefully, otherwise we will unconsciously limit the number of active warp.

Unlike register,shared memory, although declared in kernel, his life cycle is accompanied by the entire block, not a single thread. When the block is executed, the resources he has will be released and reassigned to another block.

Shared memory is the basic way of thread communication. The thread in the same block works with each other through the data in the shared memory. You must synchronize with __syncthreads () before obtaining data for the shared memory. L1 cache and shared memory use the same 64KB on-chip memory, we can also use the following API to dynamically configure both:

cudaerror_t cudafuncsetcacheconfig (const void* func, enum cudafunccachecacheconfig);

Func is an allocation policy and can be used in the following ways:

Cudafunccacheprefernone:no Preference (default)

Cudafunccacheprefershared:prefer 48KB shared memory and 16KB L1 cache

Cudafunccachepreferl1:prefer 48KB L1 Cache and 16KB shared memory

cudafunccachepreferequal:prefer equal size of L1 cache and shared memory, both 32KB

Fermi supports only the first three configurations, and Kepler supports all of them.

Constant Memory

The Constant memory resides in device memory and uses the dedicated Constant cache (PER-SM). The statement of the memory should be decorated with __connstant__ . The scope of the constant is global, and for all kernel, the size of all CC is 64KB. In the same compilation unit, constant is visible to all kernel.

Kernel can only read data from constant memory, so its initialization must use the following function call on the host side:

cudaerror_t cudamemcpytosymbol (const void* symbol, const void* src,size_t count);

The function copy src points to the address of count byte to symbol, which points to either global or constant Memory in device.

Constant memory behaves best when all thread in a warp reads data from the same memory address. For example, calculate the coefficients in a formula. If all thread reads data from a different address and reads it only once, then constant memory is not a good choice because a read constant memory operation is broadcast to all thread know.

Texture Memory

The texture memory resides in device memory and uses a read-only cache (PER-SM). Texture memory is actually a piece of global memory, but he has his own dedicated read-only cache. This cache is useful in floating-point arithmetic (not yet understood). Texture memory is the optimization strategy for 2D spatial locality, so the thread to obtain 2D data can use texture memory to achieve high performance, D3D programming has two important basic storage space, one of which is texture.

Global Memory

The global memory is the largest, highest-latency and most basic memory of the GPU. "Global" indicates its life cycle. Any SM can get its status throughout the lifetime of the program. Variables in global can be either static or dynamic declarations. You can use the __device__ modifier to qualify its properties. The global memory distribution is the Cudamalloc that was used frequently, releasing the use of Cudafree. The global memory resides in the devicememory and can be transmitted in 32-byte, 64-byte, or 128-byte three formats. These memory transaction must be aligned, meaning that the first address must be a multiple of 32, 64, or 128. Optimizing memory transaction is critical for performance gains. When Warp performs memory Load/store, the number of transaction required depends on the following two factors:

    1. Distribution of memory address across the thread of that warp is the continuation of the preceding article
    2. Alignment of memory address per transaction

In general, the more transaction is required, the more potentially unnecessary data transfer, resulting in throughput efficiency reduction.

For an established warp memory request, the number of transaction and the throughput efficiency are determined by the CC version. For CC1.0 and 1.1来, the acquisition of global memory is very stringent. And more than 1.1, due to the existence of the cache, get more easily.

GPU Cache

Like the CPU cache, the GPU cache is non-programmable. The following types of caches are included on the GPU, as mentioned in the previous article:

    • L1
    • L2
    • READ-ONLY constant
    • Read-only Texture

Each SM has a L1 cache, and all SM shares a L2 cache. Both are used to cache local and global memory and, of course, the part of register spilling. The Gpu,cuda in Fermi GPus and Kepler K40 or later allows us to configure whether data for read operations uses L1 and L2 or only L2.

On the CPU side, memory Load/store can be used by the cache. On the GPU, however, only the load operation will be cache,store.

Each SM has a read-only constant cache and texture cache to improve performance.

CUDA Variable Declaration Summary

The following table summarizes some of the memory statements presented earlier:

Static Global Memory

The following code describes how to statically declare global variable (the previous blog post is actually global variable). The general process is to first declare a float global variable, in checkglobal-variable, the value is printed, and then its value is changed. In main, this value is initialized with Cudamemcpytosymbol. Finally, when the global variable is changed, the value is copied back to the host.

#include <cuda_runtime.h>#include<stdio.h>__device__floatdevdata;__global__voidcheckglobalvariable () {//display the original valueprintf"device:the value of the global variable is%f\n", Devdata); //alter the valueDevdata + =2.0f;}intMainvoid) {    //Initialize the global variable    floatValue =3.14f; Cudamemcpytosymbol (Devdata,&value,sizeof(float)); printf ("host:copied%f to the global variable\n", value); //invoke the kernelCheckglobalvariable <<<1,1>>>(); //Copy the global variable back to the hostCudamemcpyfromsymbol (&value, Devdata,sizeof(float)); printf ("host:the value changed by the kernel to%f\n", value);    Cudadevicereset (); returnexit_success;} 

Compile run:

$ nvcc-arch=sm_20 globalvariable$. /globalvariable



Familiar with Cuda's basic ideas, it is not difficult to understand, although the host and device code is written in the same source file, but their execution in a completely different world, host can not directly access device variables, and vice versa.

We might argue that you can get the global variables of the device using the following code:

Cudamemcpytosymbol (Devd6ata, &value, sizeof (float));

However, we should also note the following points:

    • The function is Cuda's runtime API, using the GPU implementation.
    • Devdata here is just a sign, not a device's variable address.
    • In kernel, Devdata is used as a variable.

Moreover, cudamemcpy cannot pass a variable in this way &devdata, as said above, Devdata is just a sign, the operation itself is wrong:

cudamemcpy (&devdata, &value, sizeof (float), cudamemcpyhosttodevice); It ' s wrong!!!

In any case, CUDA provided us with a way to get the address of a variable using the Devdata notation:

cudaerror_t cudagetsymboladdress (void** devptr, const void* symbol);

Once you get the address, you can use cudamemcpy:

float *dptr = null;cudagetsymboladdress (void* *) &sizeof(  float), cudamemcpyhosttodevice);

We have only one way to get the GPU memory directly, i.e. using pinned memory, which is described in detail below.

Memory Management

Would coming soon ...

CUDA----Memory Model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.