Cuda learning-(1) Basic concepts of Cuda Programming

Source: Internet
Author: User
Document directory
  • Function qualifier
  • Variable type qualifier
  • Execute Configuration
  • Built-in Variables
  • Time Functions
  • Synchronous Functions
1. Parallel Computing

1) Single-core command-level parallel ILP-enables the execution unit of a single processor to execute multiple commands simultaneously

2) multi-core parallel TLP-integrate multiple processor cores on one chip to achieve line-level parallel

3) multi-processor parallelism-Install multiple processors on a single circuit board and implement process-and thread-Level Parallelism

4) the network can be used to implement large-scale cluster or distributed parallel computing. Each node is an independent computer to implement large-scale parallel computing.

Multi-thread programming can achieve thread-level parallelism between multiple CPU cores, or use hyper-threading and other technologies to better utilize resources in each core and fully utilize the computing power of the CPU.

Cuda-supported GPUs can be seen as a supercomputer composed of several vector processors, and the performance can indeed be compared with that of small super computers.

GPU and CPU are generally connected through the North Bridge (the largest and most important chip on the main board, responsible for the data exchange between CPU and memory, graphics card, often covered by heat sink or fan) through the AGP or PCI-E bus, they have independent external memory, memory and video memory respectively.

First, determine a window of windowsize * windowsize. The windowsize here is an odd number. Therefore, the window must have a central pixel. The process of median filtering is to constantly move the window, then, sort the pixel values of all pixels in the window according to the gray level, and assign the gray level of the pixel in the middle of the sorting to the pixel in the window center. This is the meaning of the median filter.

2. Cuda basic concept function qualifier _ DEVICE _: declares that a function is executed on the device and can only be called from the device _ global _: declares a function as the kernel, execute on the device. You can only call _ host _ from the host to declare that a function is executed on the host, you can only use the variable type qualifier _ constant _ and _ DEVICE _ from the host to declare the variable: resident in the constant memory space, with the life cycle of the application, it can be accessed by all threads in the grid through the Runtime library or by the host. Quantity:
The _ shared _ qualifier can be used with _ DEVICE _ to declare variables:
Resident in the shared memory space of a thread block. It has a block life cycle and can only be accessed by all threads in the block. Execute Configuration

Any call to the _ global _ function must specify the execution configuration for this call.

The execution configuration is used to define the grid and block dimensions of function execution on the device, as well as related streams. You can insert an <DG, DB, NS, S> expression between the function name and the parameter list enclosed in parentheses to define the execution configuration. The DG type is dim3, which is used to specify the dimension and size of the Grid. Therefore, dg. x * DG. Y is equal to the number of blocks to be started; DG. Z is not used. The DB type is dim3, which is used to specify the dimension and size of each block. Therefore, DB. x * dB. y * dB. Z is equal to the number of threads of each block. The NS type is size_t, which is used to specify the number of bytes in the shared memory dynamically allocated by block for this call and the static memory allocated; the dynamically allocated memory is used by any variable declared as an external array, and NS is an optional parameter whose default value is 0. The S type is cudastream_t, which is used to specify the relevant stream; s is an optional parameter whose default value is 0. Built-in variable griddim: this variable indicates the grid dimension, type: dim3blockidx: this variable indicates the block index in the grid, type: uint3blockdim: this variable indicates the dimension of the block, type: dim3threadidx: this variable indicates the thread index in the block, type: unit3 time function clock t clock (); returns the technical value that increases in each clock cycle. Take the counter value at the beginning and end of the kernel, calculate the difference between the two, and record the results of each thread, so that the metering device fully executes the number of clock cycles used by each thread, it is not the number of clock cycles used by the device to actually execute the thread command. The former is greater than the latter because the thread is executed in time. The synchronized function void _ syncthreads (); synchronizes all threads in the block. The subsequent code is executed only when all the threads used reach this synchronization point. 3. Cuda Programming Model

When calling the kernel function, Cuda runs n Different Cuda threads n times in parallel. When defining the kernel, use a new <...> syntax to specify the number of Cuda threads for each call.
// Kernel definition__global__ void vecAdd(float* A, float* B, float* C){}int main(){    // Kernel invocation    vecAdd<<<1, N>>>(A, B, C);}

Each thread that executes the kernel is assigned a unique thread ID, which can be accessed in the kernel through the built-in threadidx variable.

The following sample code adds N-sized vectors A and B, and stores the results in vector C.

__global__ void vecAdd(float* A, float* B, float* C){    int i = threadIdx.x;    C[i] = A[i] + B[i];}int main(){    // Kernel invocation    vecAdd<<<1, N>>>(A, B, C);}

Each thread that executes vecadd () executes a pair of addition operations.

4. Cuda thread structure

In general, threadidx is set to a vector containing three components.

The following sample code adds matrix A and matrix B whose size is N * n, and stores the structure in matrix C.

__global__ void matAdd(float A[N][N], float B[N][N],float C[N][N]){    int i = threadIdx.x;    int j = threadIdx.y;    C[i][j] = A[i][j] + B[i][j];}int main(){    // Kernel invocation    dim3 dimBlock(N, N);    matAdd<<<1, dimBlock>>>(A, B, C);}

The thread index has a direct relationship with the thread ID:

For one-dimensional blocks: the two are the same

The pair size is (DX, Dy) Two-Dimensional Block: Index :( x, y) ID:X + ydx

The pair size is (DX, Dy, DZ(X, y, z) ID: x + yDx + zdxdy

All threads in a block must be in the same processor core. Therefore, the limited memory resources of a processor core limit the number of threads in each block. In the NVIDIA Tesla architecture, a thread block can contain a maximum of 512 threads.

However, a kernel may be executed by multiple thread blocks of the same size. Therefore, the total number of threads should be equal to the number of threads of each block multiplied by the number of blocks. These blocks are called a one-dimensional or two-dimensional thread block mesh.

The dimension of the grid is specified by the first parameter of the <...> syntax.

Multiple blocks in the grid can be identified by one or two-dimensional indexes, which can be accessed in the kernel through the built-in blockidx variable. You can use the built-in blockdim variable to access the dimension of the block in the kernel.

In this case, the previous sample code should be changed:

__global__ void matAdd(float A[N][N], float B[N][N],float C[N][N]){    int i = blockIdx.x * blockDim.x + threadIdx.x;    int j = blockIdx.y * blockDim.y + threadIdx.y;    if (i < N && j < N)        C[i][j] = A[i][j] + B[i][j];}int main(){    // Kernel invocation    dim3 dimBlock(16, 16);    dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x,                   (N + dimBlock.y – 1) / dimBlock.y);    matAdd<<<dimGrid, dimBlock>>>(A, B, C);}

In fact, in cuda, we should use the space allocated by cudamallocpitch () to store data in two-dimensional arrays.

In general:

A grid has multiple blocks, which can be one-dimensional, two-dimensional, and three-dimensional.
A block contains multiple threads, which can be one-dimensional, two-dimensional, and three-dimensional.
The number of blocks in the grid is determined by griddim (up to two dimensions)
Block location in grid is determined by blockidx
The number of threads in a block is determined by blockdim (up to three dimensions)
Use threadidx to locate the thread in the block.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.