Since this book contains a lot of content, a lot of content is repeated with other books that explain cuda, so I only translate some key points. Time is money. Let's learn Cuda together. If any errors occur, please correct them.
Since Chapter 1 and Chapter 2 do not have time to take a closer look, we will start from Chapter 3.
I don't like being subject to people, so I don't need its header file. I will rewrite all programs. Some programs are too boring.
// Hello. Cu
# Include <stdio. h>
# Include <Cuda. h>
Int main (void ){
Printf ("Hello, world! \ N ");
Return 0;
}
The first Cuda program is not a strict Cuda program. It only uses the Cuda header file and compiles the command: nvcc hello. Cu-O hello.
Execute the command:./Hello
No task is executed on Cuda.
Second Program
# Include <stdio. h>
# Include <Cuda. h>
_ Global _ void kernel (void ){}
Int main (void ){
Kernel <1, 1> ();
Printf ("Hello, world! \ N ");
Return 0;
}
This program calls a function __global _, which means the function is called on the CPU and executed on the GPU.
What are the parameters in the three angle brackets? See the next chapter
1 # include <stdio. h>
2 # include <Cuda. h>
3 _ global _ void add (int A, int B, int * c ){
4 * c = A + B;
5}
6 int main (void)
7 {
8 int C;
9 int * dev_c;
10 cudamalloc (void **) & dev_c, sizeof (INT ));
11 add <1, 1> (2, 7, dev_c );
12 cudamemcpy (& C, dev_c, sizeof (INT), cudamemcpydevicetohost );
13 printf ("2 + 7 = % d \ n", C );
14 cudafree (dev_c );
15 return 0;
16}
17
Cudamalloc () allocates the storage space on the GPU. cudamemcpy copies the running result from the GPU to cudamemcpydevicetohost on the CPU, or copies the execution parameters from the CPU to cudamemcpyhosttodevice on the GPU.
Cudafree releases the space on the GPU, which is the same as the free space on the CPU, except that the object is different.
The focus of this chapter (for me) is 3.3 Access to the GPU (device)
In this chapter, it is said that if you do not have the GPU specification you use, or you are too lazy to disassemble it, or, in order to make your program applicable to more different hardware environments, try to program Some GPU parameters.
Let's take a look at a lot of nonsense. I want to write meaningful information.
Currently, many computers have more than one GPU graphics card, especially the integrated computing environment for graphics cards.
Int count;
Cudagetdevicecount (& COUNT );
To obtain the number of video cards in the integrated environment.
Then, the performance of the video card can be obtained through the cudadeviceprop structure.
The following is an example of cuda3.0.
The defined organization can be called directly in its own program, without the need to define it.
Struct cudadeviceprop {
Char name [256]; // device name
Size_t totalglobalmem; // byte size of global memory
Size_t sharedmemperblock; // maximum value of the shared memory that a thread block can use. Bytes. All thread blocks on the multi-processor can share these memories at the same time.
Int regsperblock; // The maximum value of 32-bit registers that can be used by thread blocks. All threads on a multi-processor can use these registers at the same time.
Int warpsize; // the size of the wrap block calculated by thread
Size_t mempitch; // memory replication is the maximum allowable spacing. The maximum spacing allowed by cudamallocpitch () is the maximum interval of the memory extraction replication function that contains the memory area, in bytes.
Int maxthreadsperblock; // maximum number of threads in each block
Int maxthreadsdim [3]; // maximum value of each block dimension
Int maxgridsize [3]; // maximum value of each grid dimension
Size_t totalconstmem; // constant memory size
Int Major; // the master code of the computing power.
Int minor; // The secondary code of the computing power.
Int clockrate; // clock rate
Size_t texturealignment; // requirements for texture alignment
Int deviceoverlap; // whether the device can simultaneously execute the core code of cudamemcpy () and the device
Int multiprocessorcount; // Number of processors on the device
Int kernelexectimeoutenabled; // can I set a limit on the execution time of core code?
Int integrated; // whether the GPU is integrated
Int canmaphostmemory; // whether the GPU can map the storage on the master CPU to the address space of the GPU Device
Int computemode; // computing mode
Int maxtexture1d; // maximum dimension of one-dimensional textures
Int maxtexture2d [2]; // maximum dimension of two-dimensional textures
Int maxtexture3d [3]; // maximum dimension of 3D textures
Int maxtexture2darray [3]; // maximum dimension of the Two-dimensional textures Array
Int concurrentkernels; // whether the GPU supports simultaneous execution of multiple core programs
}
Instance program:
1 # include <stdio. h>
2 # include <stdlib. h>
3 # include <Cuda. h>
4
5 Int main ()
6 {
7 int I;
8/* cudagetdevicecount (& COUNT )*/
9 int count;
10 cudagetdevicecount (& COUNT );
11 printf ("the Count of Cuda devices: % d \ n", count );
12 ////
13
14 cudadeviceprop prop;
15 For (I = 0; I <count; I ++)
16 {
17 cudagetdeviceproperties (& prop, I );
18 printf ("\ N --- general information for device % d --- \ n", I );
19 printf ("Name of the Cuda device: % s \ n", Prop. Name );
20 printf ("compute capability: % d. % d \ n", Prop. Major, Prop. Minor );
21 printf ("clock rate: % d \ n", Prop. clockrate );
22 printf ("device copy overlap (simulataneously perform a cudamemcpy () and kernel execution ):");
23 if (prop. deviceoverlap)
24 printf ("enabled \ n ");
25 else
26 printf ("disabled \ n ");
27 printf ("kernel execution timeout (whether there is a runtime limit for kernels executed on this device ):");
28 If (prop. kernelexectimeoutenabled)
29 printf ("enabled \ n ");
30 else
31 printf ("disabled \ n ");
32
33 printf ("\ N --- memory information for device % d --- \ n", I );
34 printf ("total global MEM in bytes: % LD \ n", Prop. totalglobalmem );
35 printf ("total constant mem: % LD \ n", Prop. totalconstmem );
36 printf ("Max mem pitch for memory copies in bytes: % LD \ n", Prop. mempitch );
37 printf ("texture alignment: % LD \ n", Prop. texturealignment );
38
39 printf ("\ N --- MP information for device % d --- \ n", I );
40 printf ("multiprocessor count: % d \ n", Prop. multiprocessorcount );
41 printf ("shared mem per MP (Block): % LD \ n", Prop. sharedmemperblock );
42 printf ("registers per MP (Block): % d \ n", Prop. regsperblock );
43 printf ("threads in warp: % d \ n", Prop. warpsize );
44 printf ("Max threads per block: % d \ n", Prop. maxthreadsperblock );
45 printf ("Max thread dimensions in a block :( % d, % d, % d) \ n", Prop. maxthreadsdim [0], Prop. maxthreadsdim [1], Prop. maxthreadsdim [2]);
46 printf ("Max blocks dimensions in a grid :( % d, % d, % d) \ n", Prop. maxgridsize [0], Prop. maxgridsize [1], Prop. maxgridsize [2]);
47 printf ("\ n ");
48
49 printf ("\ NIS the device an integrated GPU :");
50 if (prop. Integrated)
51 printf ("Yes! \ N ");
52 else
53 printf ("No! \ N ");
54
55 printf ("whether the device can map host memory into Cuda device address space :");
56 If (prop. canmaphostmemory)
57 printf ("Yes! \ N ");
58 else
59 printf ("No! \ N ");
60
61 printf ("device's computing mode: % d \ n", Prop. computemode );
62
63 printf ("\ n the maximum size for 1d textures: % d \ n", Prop. maxtexture1d );
64 printf ("the maximum dimensions for 2D textures :( % d, % d) \ n", Prop. maxtexture2d [0], Prop. maxtexture2d [1]);
65 printf ("the maximum dimensions for 3D textures :( % d, % d, % d) \ n", Prop. maxtexture3d [0], Prop. maxtexture3d [1], Prop. maxtexture3d [2]);
66 // printf ("the maximum dimensions for 2D Texture arrays :( % d, % d, % d) \ n", Prop. maxtexture2darray [0], Prop. maxtexture2darray [1], Prop. maxtexture2darray [2]);
67
68 printf ("whether the device supports executing multiple kernels within the same context simultaneously :");
69 If (prop. concurrentkernels)
70 printf ("Yes! \ N ");
71 else
72 printf ("No! \ N ");
73}
74
75}
Running result:
The Count of Cuda devices: 1
--- General information for device 0 ---
Name of the Cuda device: geforce GTX 470
Compute capability: 2.0
Clock rate: 1215000
Device copy overlap (simulataneously perform a cudamemcpy () and kernel execution): Enabled
Kernel execution timeout (whether there is a runtime limit for kernels executed on this device): Enabled
--- Memory information for device 0 ---
Total global MEM in bytes: 1341325312
Total constant mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture align: 512
--- MP information for device 0 ---
Multiprocessor count: 14
Shared mem per MP (Block): 49152
Registers per MP (Block): 32768
Threads in warp: 32
Max threads per block: 1024
Max thread dimensions in a block :( 1024,1024, 64)
Max blocks dimensions in a grid :( 65535, 65535, 65535)
Is the device an integrated GPU: No!
Whether the device can map host memory into Cuda device address space: Yes!
Device's computing mode: 0
The maximum size for 1d textures: 65536
The maximum dimensions for 2D textures :( 65536,65535)
The maximum dimensions for 3D textures :( 2048,2048, 2048)
Whether the device supports executing multiple kernels within the same context simultaneously: Yes!
Yue @ ubuntu-10 :~ /Cuda/cudabye $ Vim cudabyex331.cu
Yue @ ubuntu-10 :~ /Cuda/cudabye $ Vim cudabyex331.cu
Yue @ ubuntu-10 :~ /Cuda/cudabye $./Cuda
-Bash:./Cuda: This case or project does not exist.
Yue @ ubuntu-10 :~ /Cuda/cudabye $./cudabyex331
The Count of Cuda devices: 1
--- General information for device 0 ---
Name of the Cuda device: geforce GTX 470
Compute capability: 2.0
Clock rate: 1215000
Device copy overlap (simulataneously perform a cudamemcpy () and kernel execution): Enabled
Kernel execution timeout (whether there is a runtime limit for kernels executed on this device): Enabled
--- Memory information for device 0 ---
Total global MEM in bytes: 1341325312
Total constant mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture align: 512
--- MP information for device 0 ---
Multiprocessor count: 14
Shared mem per MP (Block): 49152
Registers per MP (Block): 32768
Threads in warp: 32
Max threads per block: 1024
Max thread dimensions in a block :( 1024,1024, 64)
Max blocks dimensions in a grid :( 65535, 65535, 65535)
Is the device an integrated GPU: No!
Whether the device can map host memory into Cuda device address space: Yes!
Device's computing mode: 0
The maximum size for 1d textures: 65536
The maximum dimensions for 2D textures :( 65536,65535)
The maximum dimensions for 3D textures :( 2048,2048, 2048)
Whether the device supports executing multiple kernels within the same context simultaneously: Yes!
Reference books: Cuda by example