Read the book "CUDA by Example a Introduction to general Purpose GPU Programming"

Source: Internet
Author: User

In view of the need to use the GPU CUDA this technology, I want to find an introductory textbook, choose Jason Sanders and other books, CUDA by Example a Introduction to the general Purpose GPU Programmin G ". This book is very good as an introductory material. I think from the perspective of understanding and memory, many of the contents of the book can be omitted, so there is this blog post. This post records and summarizes the notes and understandings of this book. Note that this article is not written in the order of chapters in the book. The 8th chapter of Image Interoperability and the 11th multi-GPU system Cuda C, these two chapters did not look. Wait for time to see again, hurriedly code word.

What is Cuda?

Cuda,compute Unified Device architecture is a graphic processor GPUs (graphics processing Units) created by Nvidia based on their company's production. A parallel computing platform and programming model that can be commonly understood as a graphics card.

The Cuda,gpus can be easily used for general computing (a bit like a numerical calculation in the CPU, etc.). Before Cuda, GPUs is typically used only for graphical rendering (e.g. through Opengl,directx).

Developers can do parallel programming by calling Cuda's API to achieve high performance computing. In order to attract more developers, Nvidia has expanded the programming language of Cuda, such as Cuda C/c++,cuda Fortran language. Note Cuda C + + can be seen as a new programming language because Nvidia configures the corresponding compiler NVCC, Cuda Fortran. For more information, refer to the literature.

64-bit Ubuntu12.04 installation CUDA5.5

Please click here for specific steps.

to Cuda C's personal ignorant feeling

If the object of the C language is to be considered a CPU and memory (next, the host memory), Cuda C works on the GPU and GPU memory (next, called Device memory), taking advantage of GPU multicore and reducing the difficulty of parallel programming. Generally read the data from the outside world through the C language, and then allocate data, to Cuda C, in order to calculate on the GPU, and then return the calculation results to the C language, in order to further work, such as further processing and display, or repeat the process.

Key concepts and namesHost

The CPU and system memory (memory bars) are called hosts.

Equipment

The display memory of the GPU and the GPU itself is called the device.

threads (thread)

is typically handled by a single core of the GPU. (can be expressed as one-dimensional, two-dimensional, three-dimensional, detailed below to elaborate).

line Cheng (Block)

1. composed of multiple threads (can be expressed as one-dimensional, two-dimensional, three-dimensional, detailed below).
2. each block is executed in parallel, with no communication between blocks and no execution order.
3. Note that the number of thread blocks is limited to no more than 65535 (hardware limit).

line Cheng (Grid)

Consists of a plurality of line Cheng (can be expressed as one-dimensional, two-dimensional, three-dimensional, detailed below).

Thread Bundle

In the Cuda architecture, the line Cheng refers to a collection of 32 threads, which are "woven together" and "unison" in the form of a set of threads. In each line of the program, each thread in the thread bundle executes the same command on different data.

kernel function (Kernel)

1.   functions that execute on the GPU are often referred to as kernel functions.
2.   is generally decorated with an identifier __global__ , called through <<< parameter 1, parameter 2>>> , which describes the number of threads in the kernel function. And how the threads are organized.
3.   is organized in line Cheng (Grid), where each line Cheng consists of several line Cheng (blocks), and each line Cheng is composed of several threads (thread).
4.   is executed in block.
5.   Just can be called in the host-side code.
6.   must declare the execution parameters of the kernel function when called.
7.   in programming, you must allocate enough space for the array or variable used in the kernel function, and then call the kernel function, otherwise there will be errors in GPU calculations, such as out-of-bounds or error, even led to blue screen and panic.

 /* * @file_name helloworld.cu suffix name. CU */ #include <stdio.h> #include <cuda_runtime.h> //header file //kernel function declaration, preceding keyword __global____global__ void kernel ( void) {}int main (void) {//kernel function call, note <<<1,1 >>>, the first 1, represents the line Cheng Gri only one thread block; the second 1, which represents only one thread in a thread block. Kernel<<<1,1>>> (); printf ( "Hello, world!\n"); return 0;}          
TD class= "Code" >
 1234567891011121314151617  
DIM3 Structure Type

1. dim3 is a uint3-defined vector type, quite a structure composed of 3 unsigned int. uint3type has three data members unsigned int x; unsigned int y; unsigned int z;
2. a one-, two-, or three-dimensional index can be used to identify a thread, constituting a one-dimensional, two-dimensional, or three-dimensional thread block.
3. the DIM3 structure type variable is used in the <<<,>>> of the kernel function call.
4. related several built-in variables
4.1. Threadidx, as the name implies, gets the ID index of thread threads; If the thread is one-dimensional then take it threadIdx.x , the two-dimensional can also take more than one value, and so on threadIdx.y to three-dimensional threadIdx.z.
4.2. blockidx, the ID index of the thread block; blockIdx.x blockIdx.y blockIdx.z
4.3. Blockdim, the dimension of the thread block, likewise,, blockDim.x blockDim.y blockDim.z .
4.4. Griddim, the dimension of line Cheng, likewise, gridDim.x gridDim.y , gridDim.z .
5. for a one-dimensional block, a thread threadID=threadIdx.x .
6. for the size (blockDim.x, blockDim.y) of the two-dimensional block, threaded threadID=threadIdx.x+threadIdx.y*blockDim.x .
7. for the size (blockDim.x, blockDim.y, blockDim.z) of the three-dimensional block, threaded threadID=threadIdx.x+threadIdx.y*blockDim.x+threadIdx.z*blockDim.x*blockDim.y .
8. The total number of compute thread index offset increments for the started thread. such as stride = blockDim.x * gridDim.x; threadId += stride .

function Modifiers

1. __global__ indicates that the modified function is executed on the device but is called on the host.
2. __device__ indicates that the modified function is executed on the device, but can only be __device__ called in other functions or __global__ functions.

common GPU Memory functionsCudamalloc ()

1. function Prototypes: cudaError_t cudaMalloc (void **devPtr, size_t size) .
2. function usefulness: Just like the malloc function in C, this function simply allocates memory in the memory of the GPU .
3. Precautions:
3.1. the Cudamalloc () assigned pointer can be passed to the function executed on the device;
3.2. You can use Cudamalloc () assigned pointers in device code to read and write device memory;
3.3. You can pass the Cudamalloc () assigned pointer to a function that executes on the host;
3.4. You cannot use Cudamalloc () assigned pointers in host code for host memory read and write operations (that is, you cannot dereference).

cudamemcpy ()

1. function Prototypes: cudaError_t cudaMemcpy (void *dst, const void *src, size_t count, cudaMemcpyKind kind) .
2. function: Just like the memcpy function in C, this function can copy data between host memory and GPU memory.
3. function Parameter: cudaMemcpyKind kind indicates the direction of data copy, if Kind is assigned to cudaMemcpyDeviceToHost indicate that data is copied from device memory to host memory.
4. as in C memcpy() , executes synchronously , that is, when the function returns, the copy operation is complete, and the output buffer contains the copied content.
5. the corresponding function with an asynchronous execution cudaMemcpyAsync() , this function is detailed in the following section of the flow about the content.

Cudafree ()

1. function Prototypes: cudaError_t cudaFree ( void* devPtr ) .
2. function: Just like the free () function in C, this function frees the memory allocated by Cudamalloc ().
The following example is used to explain the above three functions

1234567891011121314151617181920
#include <stdio.h>#include <cuda_runtime.h>__global__void Add (int A,int b, int *c) {*c = a + b;} int main (void) {int c; int *dev_c; //cudamalloc () Cudamalloc ((void**) &dev_c,  sizeof (int)); //kernel function execution Add<<<1,1>>> (2, 7, Dev_c); //cudamemcpy () cudamemcpy (&c, Dev_c, sizeof ( int), cudamemcpydevicetohost); printf ( "2 + 7 =%d\n", c); //cudafree () Cudafree (Dev_c); return 0;}          
GPU Memory ClassificationGlobal Memory

Device memory in the popular sense.

Shared Memory

1. Location: Device memory.
2. form: Keyword __shared__ added to the variable declaration. such as __shared__ float cache[10] .
3. purpose: For each thread block that is started on the GPU, the CUDA C compiler creates a copy of the shared variable. Each thread in a thread block shares this memory , but the thread cannot see or modify a copy of the variable for the other thread block. This allows multiple threads in a thread block to communicate and collaborate on the computation.

constant Memory

1. Location: Device Memory
2. form: Keyword __constant__ added to the variable declaration. such as __constant__ float s[10]; .
3. Purpose: To improve performance. Constant memory takes a different approach than standard global memory. In some cases, replacing global memory with constant memory can effectively reduce memory bandwidth.
4. Features: Constant memory is used to hold data that does not change during kernel function execution. The access limit for a variable is read-only . The NVIDIA hardware provides 64KB of constant memory. It is no longer necessary cudaMalloc() or cudaFree() , instead, to statically allocate space at compile time.
5. Requirements: When we need to copy data to the constant memory should be used cudaMemcpyToSymbol() , and cudaMemcpy() will be copied to global memory.
6. Reasons for performance improvement:
6.1. a single read operation on constant memory can be broadcast to other "neighboring" threads. This will save 15 read operations. (Why is 15, because "proximity" refers to a half-line Cheng, a thread bundle contains a collection of 32 threads.) )
6.2. constant memory data is cached, so sequential reads of the same address will not generate additional memory traffic.

7. if the thread in half of wrap requires constant memory for different data, then these 16 read data are sorted in order. Conversely, these 16 requests are answered at the same time if they are read from a global memory.

Texture Memory

1. Location: Device Memory
2. Purpose: To reduce requests for memory and provide efficient memory bandwidth. is designed for graphic applications that have a large amount of spatial locality in memory access patterns, meaning that a thread can read from a location that is "very close" to the location read by the neighboring thread. such as:

3. the texture variable (reference) must be declared as a global variable within the scope of the file .
4. form: Divided into one-dimensional texture memory and Two-dimensional texture memory .
4.1. One- dimensional texture memory
4.1.1. use texture<类型> A type declaration, such as texture<float> texIn .
4.1.2. by cudaBindTexture() binding to texture memory.
4.1.3. tex1Dfetch() to read the data in the texture memory.
4.1.4. by unbinding the cudaUnbindTexture() texture memory.
4.2. Two- dimensional texture memory
4.2.1. use texture<类型,数字> A type declaration, such as texture<float,2> texIn .
4.2.2. by cudaBindTexture2D() binding to texture memory.
4.2.3. tex2D() to read the data in the texture memory.
4.2.4. by unbinding the cudaUnbindTexture() texture memory.

Fixed Memory

1. Location: host Memory .
2. concept: Also known as page lock memory or non-paged memory, the operating system will not paging and swapping this memory to disk, ensuring that the memory always resides in physical memory. Therefore, the operating system can safely make an application access the physical address of the memory, as this memory will not be destroyed or repositioned.
3. Purpose: Improve access speed. Because the GPU knows the physical address of host memory, it is possible to replicate data between the GPU and the host through Direct Memory access DMA technology. Because DMA does not require CPU intervention to perform replication. Therefore, it is important to use fixed memory during the DMA replication process.
4. Cons: With fixed memory, all functions of virtual memory will be lost, and the system will run out of memory faster.
5. It is recommended that cudaMemcpy() you use fixed memory for source or target memory in a function call and release it immediately if you no longer need to use them.
6. form: cudaHostAlloc() assigned by function, by cudaFreeHost() release.
7. The fixed memory can only be copied asynchronously.

Atomic Nature

1. concept: If the execution of an operation cannot be decomposed into a smaller part, we will meet this condition limit operation called atomic operation.
2. form: A function call, such as atomicAdd(addr,y) a sequence of operations that will generate an atom, this sequence of operations includes reading the value at address addr, adding Y to this value, and saving the result back to address addr.

common thread manipulation functions

1. The synchronous method __syncthreads() , called by this function, will ensure that each thread in the thread block executes the __syscthreads() preceding statement before the next statement is executed.

using Events to measure performance

1. Purpose: To measure the time that the GPU spends on a task. The events in Cuda are essentially a GPU timestamp. Because events are implemented directly on the GPU. Therefore, it does not apply to mixed code designs that contain both device code and host code.
2. form: First create an event, then record the event, then calculate the difference of two events, and finally destroy the event. Such as:

12345678910
Start, Stop;cudaeventcreate (&0);//0); float elapsedtime;cudaeventelapsedtime (& ElapsedTime,start); Cudaeventdestroy (stop)   ;
Flow

1. pull a tear: The concurrency focus is on running several different tasks in a very short period of time; the parallel focus is on running a task at the same time.
2. task parallelism: refers to executing two or more different tasks in parallel, rather than performing the same task on a large amount of data.
3. concept: The Cuda stream represents a queue of GPU operations, and the operations in that queue are executed in the order specified. We can add some operations to the stream, such as kernel function start, memory copy, and event start and end. The order in which these operations are added to the stream is also the order in which they are executed . Each stream can be considered a task on the GPU, and these tasks can be executed in parallel.
4. Hardware Prerequisites: Must be a GPU that supports device overlap functionality. Supports device overlap, which is the ability to perform a copy operation between the device and the host while executing a kernel function.
5. Declaration and creation: declaration cudaStream_t stream; , Creation cudaSteamCreate(&stream); .
6. Cudamemcpyasync (): Previously cudaMemcpy() mentioned in, this is a function that executes asynchronously . At the cudaMemcpyAsync() time of invocation, just place a request to perform a memory copy operation in the stream, which is specified by the parameter stream. When the function returns, we cannot ensure that the copy operation has been started, or whether it has ended. The guarantee we can get is that the copy operation will definitely be executed before the next action is placed in the stream . The host memory pointer passed to this function must be through cudaHostAlloc() the allocated memory. (Requires fixed memory in stream)
7. Stream synchronization: through cudaStreamSynchronize() to coordinate.
8. Stream Destruction: Calls are required to destroy the stream queued for GPU operations before exiting the application cudaStreamDestroy() .
9. for multiple streams :
9.1. Remember to synchronize the convection.
9.2. when placing an operation into the queue of a stream, it should take a width-first rather than a depth-first approach, in other words, not all operations that first add a No. 0 flow, then add the following 1th, 2,... A stream. Instead , add the copy operation of A to the No. 0 stream, and then add the copy of A to the 1th stream, and then continue with other behaviors like alternating additions.
9.3. Keep in mind that the order of operations placed in a queue in a stream affects how the Cuda driver dispatches these operations and flows and how they are executed.

Tips

1. when the number of thread blocks is twice times the number of processes in the GPU, the optimal performance is achieved.
2. the first calculation performed by the kernel function is to calculate the offset of the input data. The starting offset for each thread is a value from 0 to the number of threads minus 1. Then, the increment to offset is the total number of started threads.

Instance Program

Interested readers can download the sample code included with this book click here to download.

Reference Documents

1. CUDA by Example a Introduction to General Purpose GPU programming
2. CUDA Wikipedia
3. ppt Online

Read the book "CUDA by Example a Introduction to general Purpose GPU Programming"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.