Cuda Learning and Summary 1

Source: Internet
Author: User

I. Basic CONCEPTS

1. CUDA

In 2007, NVIDIA launched the programming model of CUDA (Compute Unified device Architecture, unified Computing Device architecture) in order to make full use of the advantages of CPUs and GPUs in the application for CPU/GPU joint execution. The need for this co-execution has been reflected in the latest centralized programming model (opencl,openacc,c++ AMP).

2. Parallel programming languages and models

The most widely used are the messaging interfaces designed for scalable cluster computing (message passing INTERFACE,MPI) and OpenMP for multi-processor systems for shared storage. At present, many HPC (high-performance Computing) clusters adopt heterogeneous CPU/GPU node model, that is, MPI and Cuda mixed programming, to realize multi-machine multi-card model. Currently, CUDA-enabled programming languages are C,c++,fortran,python,java [2]. CUDA uses the parallel programming style of SPMD (Single-program multiple-data, single program multi-data).

3. Data parallelism, Task parallelism

Analytical:

Task parallelism is usually done by the task decomposition of the application. For example, for a simple application that needs to do vector addition and matrix-vector multiplication, each operation can be considered a task. If these two tasks can be executed independently, then the task parallelism can be obtained.

4. Cuda's extension to function declarations in C

Analytical:

(1) __device__ float devicefunc (): Executed on the device and can only be called from the device.

(2) __global__ float Kernelfunc (): Executes on the device and can only be called from the host.

(3) __host__ float Hostfunc (): Executes on the host and can only be called from the host.

Description

If the cuda extension keyword is not specified when the function is declared, the default function is the host function.

5. Thread,block,grid,warp,sp,sm

Analytical:

(1) grid, block, Thread: When using Cuda for programming, a grid is divided into blocks, and a block is divided into multiple thread.

(2) SP: the most basic processing unit, the final specific instructions and tasks are processed on the SP.
(3) SM: multiple SP plus other resources (such as storage resources, shared memory, reservoirs, etc. ) constitute a SM.
(4) Warp:gpu the dispatch unit when executing the program. Currently Cuda's warp size is 32, same as in a warp thread, executing the same instruction with different data resources.

6. Cuda kernel function

The complete execution configuration parameter form of the kernel function is <<<DG, Db, Ns, S>>>, as follows:

(1) Parameter DG is used to define the dimensions and dimensions of the entire grid, that is, how many blocks a grid has.

(2) Parameter DB defines the dimensions and dimensions of a block, that is, how many thread a block has.

(3) The parameter NS is an optional parameter that sets the shared memory size, in bytes, that can be dynamically allocated for each block, in addition to the statically allocated shared memory. This value is 0 or omitted when dynamic allocation is not required.

(4) The parameter s is an optional parameter of type cudastream_t, with an initial value of zero, indicating which stream the kernel function is in.

7. Cuda Storage System

(1) Register

(2) Local memory

(3) Shared memory

(4) Global memory

(5) Constant memory

8. CUDA SDK

9. Cuda Software Stack

Description

the Cuda software stack contains multiple tiers, device drivers, application programming interfaces (APIs) and their runtimes, two higher-level general-purpose math libraries, cufft and Cublas. In contrast, Cuda driver is the underlying API, and Cuda runtime is a high-level API.

Reference documents:

[1] Java bindings for cuda:http://jcuda.org/

Cuda Learning and Summary 1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.