Cuda Software System

Last Update:2014-09-24 Source: Internet

Author: User

Tags nvcc

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Cuda software stack consists of the following three layers: Cuda library, Cuda Runtime API, and Cuda driver API. The core of Cuda is the Cuda C language, which includes the minimal extension set for the C language and a Runtime Library, source files using these extensions and runtime libraries must be compiled using the nvcc compiler.

Cuda C language compilation only generates GPU-side code. To manage and allocate GPU resources, allocate a video memory on the GPU and start the kernel function, you must use the Runtime API (Runtime API) of Cuda) or Cuda driver API. In a program, only one of the Cuda runtime APIs and the Cuda driver APIs can be used in combination.

I. Cuda C Language

The Cuda C language provides a programming method for writing device code in C language, including some necessary extensions of C and A Runtime Library. Cuda's c extension mainly includes the following aspects:

1. introduced the function type delimiter. Specifies whether the function is executed on the host or device, and whether the function is called from the host or device. These delimiters are :__ device __,__ host __,__ global __.

2. A variable type qualifier is introduced. Specifies the type of memory in which variables are stored. For traditional programs running on the CPU, the compiler can automatically decide whether to store variables in the CPU register or memory. In the Cuda programming model, A total of 8 different memories of Multi-Brown are abstracted. To distinguish various types of memory, some delimiters must be introduced, including: __device __,__ shared _ and _ constant __. Note that the _ DEVICE _ here is different from the _ DEVICE _ qualifier in the previous section.

3. the built-in variable type is introduced. For example, char4, ushort3, double2, and dim3 are basic integer float-type vectors. Each component is accessed through X, Y, Z, and W, in the device code, the vector types have different alignment requirements.

4. Four built-in variables are introduced. Blockidx and threadidx are used to index thread blocks and threads. griddim and blockdim are used to describe the dimensions of thread grids and thread blocks. Warpzize is used to query the number of threads in warp.

5. The <> operator is introduced. It is used to specify the thread mesh and thread dimension to pass execution parameters.

6. introduced some function sensitivity: Memory fence function, synchronous function, mathematical function, texture function, test time function, atomic function, warp vote function.

The above extensions have some restrictions. If these restrictions are violated, nvcc will give an error or warning message, but sometimes it will not report an error and the program cannot run.

Ii. nvcc Compiler

The nvcc compiler compiles the Cuda C code according to the configuration and generates three different outputs: PTX, Cuda binary sequence, and standard C. Nvcc is a compilation driver. Through the command line option, nvcc can start different tools at different stages of compilation to complete compilation.

The basic process of nvcc is to first use cudafe to separate the host and device code in the source file, and then call different compilers to compile the Code separately. The device code is compiled from nvcc into PTX code or binary code. The host code is output in the form of a C file and is compiled by other Main Performance compilers, the old ICC, GCC or other suitable high-performance compilers. However, the host code can also be handed over to other compilers directly at the final stage of compilation to generate. OBJ or. O files. During compilation, you can link the device code to the generated host code and include the Cubin object as a global initialization data array. In this case, the kernel execution configuration should also be converted to Cuda run startup code to load and start compiled kernel functions. When using the Cuda driver API, You can execute the PTX code or Cubin object separately, and ignore the host Code Compiled by nvcc.

The compiler frontend processes Cuda source files according to the C ++ syntax rules. The Cuda host Code supports the complete C ++ syntax, while the device Code does not.

Kernel functions may be written in PTX, but usually written in advanced languages such as Cuda C. Kernel functions written in PTX or Cuda C language must be compiled into binary code through the nvcc compiler. Some PTX commands can only be executed on hardware with high computing power. For example, 32-bit atomic operation commands on global memory can only be supported by hardware with a computing power of more than 1.1, dual-precision computing is only supported by hardware with a computing capability of more than 1.3. Nvcc uses the compilation option to specify the computing power of the PTX code to be output. Therefore, the compilation option-arch sm_13 (or higher computing capability) must be added to the dual-precision computation. Otherwise, the dual-precision computation will be compiled into inter-precision computation.

Iii. Runtime API and driver API

Cuda Runtime API and Cuda driver API provide device management, context management, memory control, and code module management ), execute control, texture Reference Management, and interoperity with OpenGL and direct3d application interfaces.

The Cuda Runtime API is encapsulated on the basis of the Cuda driver API, which hides some implementation details, makes programming more convenient, and makes the code more concise. The Cuda Runtime API is packaged in the cudaart package, and all functions have the Cuda prefix. Cuda does not have a special initialization function at runtime. It will automatically complete initialization when the function is called for the first time. During testing the Cuda program using the runtime function, you must avoid calculating the initialization time. The Cuda Runtime API programming is simple and usually used for development.

Cuda driver API is a handle-based underlying interface that allows multiple objects to be referenced by a handle. It can load kernel function modules in binary or assembler form, specify parameters, and start computing. Cuda driver API programming is complex, but sometimes it can implement more complex function keys or achieve higher performance by directly operating the hardware execution. Because the device-side code used is binary or assembly code, it can be called in various languages. The Cuda driver API is placed in the nvcuda package. All functions are prefixed with Cu.

Iv. Cuda function library

Currently, Cuda has three function libraries: cufft, cublas, and cudpp, providing simple and efficient common functions. In the future, Cuda will also provide video codec and image processing libraries, such as cuvid, to further expand functions. Cufft is a function library that uses GPU for Fourier transformation and provides interfaces similar to the widely used fftw library. The difference is that the data of fftw operations is stored in the middle, while the data of cufft operations is stored in the video memory, which cannot be directly replaced by each other and must be added to the data exchange between the video memory and the memory, to replace the fftw library. The cublas library is a basic matrix and vector computing library. It provides interfaces similar to blaS and can be used for simple matrix computing or as a basis to build more complex function packages, for example, LAPACK, and so on, the data of the cublas operation is also stored in the video memory. It also needs to be encapsulated before replacing the function in Blas. Cudpp provides many basic parallel operations for Changzhou, such as sorting and searching. It can be used as a basic component to quickly build parallel computing programs. By calling the above function libraries, programmers can achieve high performance without having to design complex algorithms based on hardware features, greatly shortening development time. The disadvantage is that the above function libraries are slightly less flexible, and may cause excessive memory access.

Cuda Software System

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More