Cuda in the given instance program appeared a lot of times #prama unroll usage, collected data collated as follows:
1. Description given in the Official document CUDA C Programming Guide v6.5:
By default, the compiler unrolls small loops with a known trip count. The #pragma unroll directive however can be used to control unrolling of any given loops. It must is placed immediately before the loop and only applies to that loop. It's optionally followed by a number, this specifies how many times, the loop must be unrolled. For example, in this code sample:
[CPP] view plain copy #pragma unroll 5 for (int i = 0; i < n; ++i)
The loop would be unrolled 5 times. The compiler would also insert code to ensure correctness (in the example above, to ensure that there would only be n Iterat Ions if n is less than 5, for example). It's up to the programmer-make sure, the specified unroll number gives the best performance.
#pragma unroll 1 would prevent the compiler from ever unrolling a loop. If no number is specified after #pragma unroll, the loop is completely unrolled if it trip count is constant, otherwise I T is no unrolled at all.
By default, the compiler expands on a small loop of known number of times, #pragma unroll can be used to control any given loop. However #pragma unroll must be placed in front of the controlled loop, followed by the expand Times option.
The compiler guarantees correctness when compiling, while performance is determined by the programmer.
followed by parameter 1, the compiler does not expand the loop. If there are no parameters, and the number of loops is a constant, the compiler will fully expand the loop, and if it is not a constant, it will not expand at all.
2. #pragma unroll usage
#pragma宏命令主要是改变编译器的编译行为, other parameters online more information, I just want to simply say #pragma unroll usage, because the online information is relatively small, and said more general, please see the following section of code [CPP] view Plain copy int main () {int a[100]; #pragma unroll 4 for (int i=0;i<100;i++) {a[i]=i; } return 0; }
The loop is the main manifestation of a program's run time, and by using the #pragma unroll command, the compiler encounters the command to expand the loop as it compiles, such as a loop that has fewer loops
[CPP] view plain copy for (int. i=0;i<4;i++) cout<< "Hello World" <<endl;
Can be expanded to: [CPP] view plain copy cout<< "Hello World" <<endl; cout<< "Hello World" <<endl; cout<< "Hello World" <<endl; cout<< "Hello World" <<endl;
This will make the program more efficient and, of course, most compilers are now automatically optimized for this, and by using the #pragma unroll command you can control how much the compiler will expand the loop. Or back to the very beginning of the program, his loop unfolded in the form of:
[CPP] view plain copy for (int i=0;i<100;i+=4) {a[i]=i; a[i+1]=i+1; a[i+2]=i+2; a[i+3]=i+3; }
3. Cuda's compilation
Cuda's compiler integrates various compilation tools for NVCC,NVCC, which implement different stages of compilation. The basic workflow for NVCC is to detach the device code from the host code and then compile it into a binary or Cubin project. During execution, the host code is ignored, and the device code is loaded and executed through the Cuda appliance API.
Cuda source code is based on C + + syntax in the compiler front-end. C + + is fully supported in the host code, but only C in C + + can be supported in device. Classes in C + +, inheritance, and syntax for defining variables in basic blocks are not allowed in kernel. The void type pointer in C + + cannot be assigned to a non-void pointer without a type conversion.
For more information on NVCC, see: http://download.csdn.net/source/1173428
__noinline__
The __device__ function is inline by default, and the __noinline__ qualifier can prompt the compiler not to inline the specified function. The compiler does not support pointer arguments and a number of parameters for functions using __noinline__
#pragma unroll
By default, the compiler will iterate a small number of times, #pragma unroll can specify how many times the loop is expanded (the programmer must ensure that the expansion is correct), for example
#pragma unroll 5
for ()
Pragma unroll must be processed immediately after the loop.
#pragmatic unroll 1 prohibits the compiler from expanding the loop.
If you do not specify the number of times, the loop will be fully expanded for a constant number of cycles, and the loop will not be expanded for an indeterminate number of loops.