Getting started with http://www.cnblogs.com/Fancyboy2004/archive/2009/04/28/1445637.html cuda-GPU hardware architecture
Here we will briefly introduce that NVIDIA currently supports Cuda GPU, Which is executing CudaProgram(Basically, its shader unit) architecture. The data here is a combination of the information posted by nvidia and the data provided by NVIDIA in various seminars and school courses. Therefore, there may be errors. Main data sources include NVIDIA's Cuda programming guide 1.1, NVIDIA's introduction to Cuda sessions at supercomputing '07, And the Cuda course of uiuc.
Currently, NVIDIA provides a display chip that supports Cuda for the g80 series. The g80 display chip supports Cuda 1.0, while g84, g86, G92, g94, and g96 support Cuda 1.1. Basically, except for the earliest geforce 8800 ultra/GTX and 320 MB/640 MB versions of geforce 8800gts, Tesla and other graphics cards are Cuda 1.0, other geforce 8 series and 9 series graphics cards support Cuda 1.1. For details, refer to the Appendix A in Cuda programming guide 1.1.
All NVIDIA display chips that currently support Cuda are composed of multipleMultiprocessors. Each multiprocessor contains eightStream ProcessorsIt is composed of four or four groups, that is, it can be viewed as a SIMD processor with two groups of 4D. In addition, each multiprocessor also has 8192 registers, 16 KB share memory, texture cache and constant cache. As shown in the following figure:
The detailed multiprocessor information can be obtained through the CudaCudagetdeviceproperties ()Obtain the function or cudevicegetproperties. However, there is no way to directly obtain information on how many multiprocessor chips are displayed.
In cuda, most basic operations can be performed by stream processor. Each stream processor contains an FMA (fused-multiply-add) Unit, which can be used for multiplication and addition. Complex operations take a long time.
When executing the Cuda program, each stream processor corresponds to a thread. Each multiprocessor corresponds to a block. From the previousArticleIt can be noted that a block often has many threads (such as 256), which far exceeds the number of all stream processors of A multiprocessor. What is the problem?
In fact, although a Multiprocessor has only eight stream processors, stream processor provides latency for various operations and does not need to mention latency for memory access. Therefore, when executing a program, Cuda usesWarp. In the current Cuda device, a warp contains 32 threads, which are divided into two groups of 16 threads half-warp. Because stream processor has at least 4 cycles latency, for a 4D stream processors, at least 16 threads (that is, half-warp) are executed at a time) to effectively hide latency of various operations.
Because multiprocessor does not have much memory, the status of each thread is saved directly in the register of multiprocessor. Therefore, if a Multiprocessor has more threads to execute at the same time, more register space is required. For example, if a block contains 256 threads and each thread uses 20 registers, a total of 256x20 = 5,120 registers are required to save the status of each thread.
Currently, each multiprocessor in a Cuda device has 8,192 registers. Therefore, if each thread uses 16 registers, it means that a Multiprocessor can only maintain the execution of up to 512 threads at the same time. If the number of threads simultaneously exceeds this number, you need to store a portion of the data in the graphics card memory, which will reduce the execution efficiency.
Editor's note: the size of the register file in NVIDIA gt200 is doubled. The available register file in fp32 is 16 K, and that in fp64 is 8 K.
At present, each multiprocessor in a Cuda device has a 16 KB shared memory. Shared Memory is divided into 16 banks. If each thread Accesses Different banks at the same time, there will be no problem. The speed of accessing shared memory is the same as that of the access register. However, if two (or more) threads access data from the same bank at the same time, bank conflict will occur, and these threads Must be accessed in order, shared Memory cannot be accessed at the same time.
Shared Memory is divided into banks in 4 bytes. Therefore, assume the following data:
_ Shared _ int data [128];
Data [0] is bank 0, data [1] is bank 1, data [2] is bank 2 ,... Data [15] is bank 15, and data [16] returns to bank 0. Since warp is executed in half-warp mode, threads of different half warp will not cause bank conflict.
Therefore, if the program accesses shared memory in the following ways:
Int number = data [base + TID];
Then there will be no bank conflict, which can achieve the highest efficiency. However, if the following method is used:
Int number = data [base + 4 * TID];
Then, Thread 0 and thread 4 will be accessed to the same bank, and thread 1 and thread 5 will be the same, which will cause bank conflict. In this example, the 16 threads of a half warp will have four threads accessing the same bank, so the speed of accessing share memory will change to 1/4.
An important exception is that when multiple threads access the same shared memory address, shared memory can broadcast 32 bits of this address to all read threads, therefore, it will not cause bank conflict. For example:
Int number = data [3];
This will not cause bank conflict, because all threads read data from the same address.
In many cases, the Bank conflict of shared memory can be solved by modifying the data storage method. For example, the following program:
Data [TID] = global_data [TID];
...
Int number = data [16 * TID];
This may cause serious bank conflict. To avoid this problem, You Can slightly modify the data arrangement method and change the access method:
Int ROW = tid/16;
Int column = TID % 16;
Data [row * 17 + column] = global_data [TID];
...
Int number = data [17 * TID];
This will not cause bank conflict.
Editor's note: share memory has different names in NVIDIA documents, such as PDC (parallel data cache) and pbsm (Per-block share memory ).
Because multiprocessor does not cache global memory (if each multiprocessor has its own global memory cache, it will need the cache coherence protocol, which will greatly increase the cache complexity ), therefore, the latency of global memory access is very long. In addition, the access to global memory has also been mentioned in the previous article, which should be as continuous as possible. This is the result of DRAM access.
More accurately, the access to global memory must be "coalesced ". The so-called coalesced indicates that the starting address of the coalesced must be 16 times the size accessed by each thread. For example, if each thread reads 32 bits data, the address read by the first thread must be a multiple of 16*4 = 64 bytes.
If some threads do not read the memory, it does not affect the access of other threads to coalesced. For example:
If (TID! = 3 ){
Int number = data [TID];
}
Although thread 3 does not read data, because other threads still meet the coalesced condition (assuming that the data address is a multiple of 64 bytes), such memory read will still meet the coalesced condition.
In the current Cuda 1.1 device, the memory data size read by each thread can be 32 bits, 64 bits, or 128 bits. However, the efficiency of 32 bits is the best. 64 bits is slightly less efficient, while 128 bits is much less efficient than 32 bits (but still better than non-coalesced ).
If the data accessed by each thread is not 32 bits, 64 bits, or 128 bits, the coalesced condition cannot be met. For example, the following program:
Struct vec3d {float x, y, z ;};
...
_ Global _ void func (struct vec3d * data, float * output)
{
Output [TID] = data [TID]. x * Data [TID]. x +
Data [TID]. y * Data [TID]. Y +
Data [TID]. z * data [TID]. Z;
}
It is not coalesced reading because the vec3d size is 12 bytes instead of 4 bytes, 8 bytes, or 16 bytes. To solve this problem, use the _ align (n) _ Directive, for example:
Struct _ align _ (16) vec3d {float x, y, z ;};
This causes compiler to add an empty 4 bytes behind vec3d to complete 16 bytes. Another method is to convert the data structure into three consecutive arrays, for example:
_ Global _ void func (float * X, float * y, float * z, float * output)
{
Output [TID] = x [TID] * X [TID] + Y [TID] * Y [TID] +
Z [TID] * Z [TID];
}
If the data structure cannot be adjusted for other reasons, you can also consider using shared memory to adjust the structure on the GPU. For example:
_ Global _ void func (struct vec3d * data, float * output)
{
_ Shared _ float temp [thread_num * 3];
Const float * fdata = (float *) data;
Temp [TID] = fdata [TID];
Temp [TID + thread_num] = fdata [TID + thread_num];
Temp [TID + thread_num * 2] = fdata [TID + thread_num * 2];
_ Syncthreads ();
Output [TID] = temp [TID * 3] * temp [TID * 3] +
Temp [TID * 3 + 1] * temp [TID * 3 + 1] +
Temp [TID * 3 + 2] * temp [TID * 3 + 2];
}
In the above example, we first use a continuous method to read data from global memory to shared memory. Because shared memory does not need to worry about the access sequence (but pay attention to the bank conflict problem, refer to the previous section), you can avoid the non-coalesced read problem.
Cuda supports texture. In the Cuda Kernel Program, the texture unit of the display chip can be used to read texture data. The biggest difference between texture and global memory is that texture can only be read and cannot be written, and the display chip has a certain size of texture cache. Therefore, reading texture does not need to comply with the coalesced rules, and can achieve good efficiency. In addition, when reading texture, you can also use the texture filtering function (for example, bilinear filtering) in the display chip, or you can quickly convert the data type, for example, you can directly convert the data of 32 bits rgba to four 32 bits floating point numbers.
The texture cache on the display chip is designed for general graphics applications. Therefore, it is still most suitable for block-type access operations rather than random access. Therefore, it is best for each thread in a warp to read data with similar addresses to achieve the highest efficiency.
For data that already complies with the coalesced rules, using global memory is usually faster than using texture.
The Computing Unit in stream processor is basically the fused multiply-add unit of a floating point number. That is to say, it can perform a multiplication and an addition, as shown below:
A = B * C + D;
Compiler automatically combines appropriate addition and multiplication operations into an fmad command.
In addition to addition and multiplication of floating point numbers, addition, bitwise operation, comparison, minimum value, maximum value, and type conversion (floating point to integer or integer to floating point) all can be done at full speed. Integer multiplication cannot be performed at full speed, but 24 bits Multiplication can. In cuda, the built-in _ mul24 and _ umul24 functions can be used for integer multiplication of 24 bits.
The division of floating-point numbers is calculated using the reciprocal and multiplication methods. Therefore, the accuracy cannot meet the IEEE 754 standard (the maximum error is 2 ULP ). The built-in _ fdividef (x, y) provides faster Division, which has the same precision as general division, but returns an error when the value is less than 2216 <Y <2218.
In addition, Cuda provides some low-precision internal functions, including _ expf, _ logf, _ sinf, _ cosf, and _ powf. These functions are faster, but less accurate than standard ones. For detailed data, refer to the appendix B in Cuda programming guide 1.1.
Data transmission with the primary memory |
In cuda, GPUs cannot directly access the main memory, but can only access the Display memory on the graphics card. Therefore, you need to first copy the data from the main memory to the graphics card memory for calculation, and then copy the results from the graphics card memory to the main memory. These replication actions are limited to the speed of PCI Express. When PCI Express x16 is used, PCI Express 1.0 provides two-way bandwidth of 4 Gb/s each, while PCI Express 2.0 provides 8 Gb/s bandwidth. Of course, this is a theoretical value.
When copying data from the general memory to the graphics card memory, because the general memory may be moved by the operating system at any time, Cuda will first copy the data to an internal memory, you can use DMA to copy data to the video card memory. To avoid this duplicate copy operation, you can use the cudamallochost function to obtain a page locked memory in the primary memory. However, if too much page locked memory is required, it will affect the memory management of the operating system and may reduce the system efficiency.
Here we will briefly introduce the architecture of NVIDIA currently supports Cuda GPUs in the Cuda Program (basically its shader unit. The data here is a combination of the information posted by nvidia and the data provided by NVIDIA in various seminars and school courses. Therefore, there may be errors. Main data sources include NVIDIA's Cuda programming guide 1.1, NVIDIA's introduction to Cuda sessions at supercomputing '07, And the Cuda course of uiuc.
Currently, NVIDIA provides a display chip that supports Cuda for the g80 series. The g80 display chip supports Cuda 1.0, while g84, g86, G92, g94, and g96 support Cuda 1.1. Basically, except for the earliest geforce 8800 ultra/GTX and 320 MB/640 MB versions of geforce 8800gts, Tesla and other graphics cards are Cuda 1.0, other geforce 8 series and 9 series graphics cards support Cuda 1.1. For details, refer to the Appendix A in Cuda programming guide 1.1.
All NVIDIA display chips that currently support Cuda are composed of multipleMultiprocessors. Each multiprocessor contains eightStream ProcessorsIt is composed of four or four groups, that is, it can be viewed as a SIMD processor with two groups of 4D. In addition, each multiprocessor also has 8192 registers, 16 KB share memory, texture cache and constant cache. As shown in the following figure:
The detailed multiprocessor information can be obtained through the CudaCudagetdeviceproperties ()Obtain the function or cudevicegetproperties. However, there is no way to directly obtain information on how many multiprocessor chips are displayed.
In cuda, most basic operations can be performed by stream processor. Each stream processor contains an FMA (fused-multiply-add) Unit, which can be used for multiplication and addition. Complex operations take a long time.
When executing the Cuda program, each stream processor corresponds to a thread. Each multiprocessor corresponds to a block. From the previous article, we can note that a block often has many threads (such as 256), which far exceeds the number of all stream processors of A multiprocessor. What is the problem?
In fact, although a Multiprocessor has only eight stream processors, stream processor provides latency for various operations and does not need to mention latency for memory access. Therefore, when executing a program, Cuda usesWarp. In the current Cuda device, a warp contains 32 threads, which are divided into two groups of 16 threads half-warp. Because stream processor has at least 4 cycles latency, for a 4D stream processors, at least 16 threads (that is, half-warp) are executed at a time) to effectively hide latency of various operations.
Because multiprocessor does not have much memory, the status of each thread is saved directly in the register of multiprocessor. Therefore, if a Multiprocessor has more threads to execute at the same time, more register space is required. For example, if a block contains 256 threads and each thread uses 20 registers, a total of 256x20 = 5,120 registers are required to save the status of each thread.
Currently, each multiprocessor in a Cuda device has 8,192 registers. Therefore, if each thread uses 16 registers, it means that a Multiprocessor can only maintain the execution of up to 512 threads at the same time. If the number of threads simultaneously exceeds this number, you need to store a portion of the data in the graphics card memory, which will reduce the execution efficiency.
Editor's note: the size of the register file in NVIDIA gt200 is doubled. The available register file in fp32 is 16 K, and that in fp64 is 8 K.
At present, each multiprocessor in a Cuda device has a 16 KB shared memory. Shared Memory is divided into 16 banks. If each thread Accesses Different banks at the same time, there will be no problem. The speed of accessing shared memory is the same as that of the access register. However, if two (or more) threads access data from the same bank at the same time, bank conflict will occur, and these threads Must be accessed in order, shared Memory cannot be accessed at the same time.
Shared Memory is divided into banks in 4 bytes. Therefore, assume the following data:
_ Shared _ int data [128];
Data [0] is bank 0, data [1] is bank 1, data [2] is bank 2 ,... Data [15] is bank 15, and data [16] returns to bank 0. Since warp is executed in half-warp mode, threads of different half warp will not cause bank conflict.
Therefore, if the program accesses shared memory in the following ways:
Int number = data [base + TID];
Then there will be no bank conflict, which can achieve the highest efficiency. However, if the following method is used:
Int number = data [base + 4 * TID];
Then, Thread 0 and thread 4 will be accessed to the same bank, and thread 1 and thread 5 will be the same, which will cause bank conflict. In this example, the 16 threads of a half warp will have four threads accessing the same bank, so the speed of accessing share memory will change to 1/4.
An important exception is that when multiple threads access the same shared memory address, shared memory can broadcast 32 bits of this address to all read threads, therefore, it will not cause bank conflict. For example:
Int number = data [3];
This will not cause bank conflict, because all threads read data from the same address.
In many cases, the Bank conflict of shared memory can be solved by modifying the data storage method. For example, the following program:
Data [TID] = global_data [TID];
...
Int number = data [16 * TID];
This may cause serious bank conflict. To avoid this problem, You Can slightly modify the data arrangement method and change the access method:
Int ROW = tid/16;
Int column = TID % 16;
Data [row * 17 + column] = global_data [TID];
...
Int number = data [17 * TID];
This will not cause bank conflict.
Editor's note: share memory has different names in NVIDIA documents, such as PDC (parallel data cache) and pbsm (Per-block share memory ).
Because multiprocessor does not cache global memory (if each multiprocessor has its own global memory cache, it will need the cache coherence protocol, which will greatly increase the cache complexity ), therefore, the latency of global memory access is very long. In addition, the access to global memory has also been mentioned in the previous article, which should be as continuous as possible. This is the result of DRAM access.
More accurately, the access to global memory must be "coalesced ". The so-called coalesced indicates that the starting address of the coalesced must be 16 times the size accessed by each thread. For example, if each thread reads 32 bits data, the address read by the first thread must be a multiple of 16*4 = 64 bytes.
If some threads do not read the memory, it does not affect the access of other threads to coalesced. For example:
If (TID! = 3 ){
Int number = data [TID];
}
Although thread 3 does not read data, because other threads still meet the coalesced condition (assuming that the data address is a multiple of 64 bytes), such memory read will still meet the coalesced condition.
In the current Cuda 1.1 device, the memory data size read by each thread can be 32 bits, 64 bits, or 128 bits. However, the efficiency of 32 bits is the best. 64 bits is slightly less efficient, while 128 bits is much less efficient than 32 bits (but still better than non-coalesced ).
If the data accessed by each thread is not 32 bits, 64 bits, or 128 bits, the coalesced condition cannot be met. For example, the following program:
Struct vec3d {float x, y, z ;};
...
_ Global _ void func (struct vec3d * data, float * output)
{
Output [TID] = data [TID]. x * Data [TID]. x +
Data [TID]. y * Data [TID]. Y +
Data [TID]. z * data [TID]. Z;
}
It is not coalesced reading because the vec3d size is 12 bytes instead of 4 bytes, 8 bytes, or 16 bytes. To solve this problem, use the _ align (n) _ Directive, for example:
Struct _ align _ (16) vec3d {float x, y, z ;};
This causes compiler to add an empty 4 bytes behind vec3d to complete 16 bytes. Another method is to convert the data structure into three consecutive arrays, for example:
_ Global _ void func (float * X, float * y, float * z, float * output)
{
Output [TID] = x [TID] * X [TID] + Y [TID] * Y [TID] +
Z [TID] * Z [TID];
}
If the data structure cannot be adjusted for other reasons, you can also consider using shared memory to adjust the structure on the GPU. For example:
_ Global _ void func (struct vec3d * data, float * output)
{
_ Shared _ float temp [thread_num * 3];
Const float * fdata = (float *) data;
Temp [TID] = fdata [TID];
Temp [TID + thread_num] = fdata [TID + thread_num];
Temp [TID + thread_num * 2] = fdata [TID + thread_num * 2];
_ Syncthreads ();
Output [TID] = temp [TID * 3] * temp [TID * 3] +
Temp [TID * 3 + 1] * temp [TID * 3 + 1] +
Temp [TID * 3 + 2] * temp [TID * 3 + 2];
}
In the above example, we first use a continuous method to read data from global memory to shared memory. Because shared memory does not need to worry about the access sequence (but pay attention to the bank conflict problem, refer to the previous section), you can avoid the non-coalesced read problem.
Cuda supports texture. In the Cuda Kernel Program, the texture unit of the display chip can be used to read texture data. The biggest difference between texture and global memory is that texture can only be read and cannot be written, and the display chip has a certain size of texture cache. Therefore, reading texture does not need to comply with the coalesced rules, and can achieve good efficiency. In addition, when reading texture, you can also use the texture filtering function (for example, bilinear filtering) in the display chip, or you can quickly convert the data type, for example, you can directly convert the data of 32 bits rgba to four 32 bits floating point numbers.
The texture cache on the display chip is designed for general graphics applications. Therefore, it is still most suitable for block-type access operations rather than random access. Therefore, it is best for each thread in a warp to read data with similar addresses to achieve the highest efficiency.
For data that already complies with the coalesced rules, using global memory is usually faster than using texture.
The Computing Unit in stream processor is basically the fused multiply-add unit of a floating point number. That is to say, it can perform a multiplication and an addition, as shown below:
A = B * C + D;
Compiler automatically combines appropriate addition and multiplication operations into an fmad command.
In addition to addition and multiplication of floating point numbers, addition, bitwise operation, comparison, minimum value, maximum value, and type conversion (floating point to integer or integer to floating point) all can be done at full speed. Integer multiplication cannot be performed at full speed, but 24 bits Multiplication can. In cuda, the built-in _ mul24 and _ umul24 functions can be used for integer multiplication of 24 bits.
The division of floating-point numbers is calculated using the reciprocal and multiplication methods. Therefore, the accuracy cannot meet the IEEE 754 standard (the maximum error is 2 ULP ). The built-in _ fdividef (x, y) provides faster Division, which has the same precision as general division, but returns an error when the value is less than 2216 <Y <2218.
In addition, Cuda provides some low-precision internal functions, including _ expf, _ logf, _ sinf, _ cosf, and _ powf. These functions are faster, but less accurate than standard ones. For detailed data, refer to the appendix B in Cuda programming guide 1.1.
Data transmission with the primary memory |
In cuda, GPUs cannot directly access the main memory, but can only access the Display memory on the graphics card. Therefore, you need to first copy the data from the main memory to the graphics card memory for calculation, and then copy the results from the graphics card memory to the main memory. These replication actions are limited to the speed of PCI Express. When PCI Express x16 is used, PCI Express 1.0 provides two-way bandwidth of 4 Gb/s each, while PCI Express 2.0 provides 8 Gb/s bandwidth. Of course, this is a theoretical value.
When copying data from the general memory to the graphics card memory, because the general memory may be moved by the operating system at any time, Cuda will first copy the data to an internal memory, you can use DMA to copy data to the video card memory. To avoid this duplicate copy operation, you can use the cudamallochost function to obtain a page locked memory in the primary memory. However, if too much page locked memory is required, it will affect the memory management of the operating system and may reduce the system efficiency.