Introduction to Cuda C Programming
1. From graphic processing to general parallel computing
Driven by the huge market demand for real-time and HD 3D graphics, programmable graphics processing units or GPUs have evolved into highly parallel, multi-threaded, and multi-core processors with high computing power and high memory bandwidth. 1 and 2.
Figure 1 Number of floating point computing times per second between CPU and GPU
Figure 2 CPU and GPU memory bandwidth
The difference between the CPU and GPU in floating point computing power is due to GPU-dedicated intensive computing and high parallel computing-just as graphic rendering-so designed to do this, more transistors are used for data processing rather than data caching and stream control, as shown in 3.
Figure 3 GPU uses more transistors for Data Processing
More specifically, GPU is particularly suitable for handling issues that can be expressed as parallel data computing-parallel execution of the same program on many elements-higher computing strength-ratio of computing operations to memory operations. Because the same program is executed on each data element, there is only a relatively low requirement for complex flow control. Because the execution on each data element has a high computing intensity, the memory access latency in the computing is hidden, replacing the large data cache.
Data Parallel Processing maps data elements to parallel processing threads ). Many applications that process large datasets use parallel data programming models to accelerate the computing process. In 3D rendering, pixels and vertices of large sets are mapped to parallel threads for processing. Similarly, image and media processing applications, for example, after-processing of rendered images, video encoding and decoding, image scaling, stereoscopic vision, and pattern recognition can map image regions and pixels to parallel threads for computing. In fact, many areas other than Image Rendering and processing can be accelerated through parallel data processing, from general signal processing, physical simulation to computational finance and computational biology.
1.2.cuda: A general parallel computing platform and Programming Model
In November 2006, NVIDIA introduced cuda, a general parallel computing platform and programming model. Using the parallel computing engine in NVIDIA's GPUs, it is more effective than the CPU to solve many complex computing problems. The software environment provided by Cuda allows developers to use C as an advanced programming language. 4. Support for other languages, application interfaces, and command-based methods, such as Fortran, directcompute, and openacc.
Figure 4 GPU computing program, Cuda designed to support multiple languages and application interfaces
1. 3. A Scalable Programming Model
The emergence of multi-core CPUs and GPUs means that the mainstream processing chips are now parallel systems. In addition, its parallelism continues to increase according to Moore's Law. The challenge for developing application software is to explicitly scale parallelism to take advantage of the number of added processor cores. It is the same as that of a 3D graphics application that explicitly scales its parallelism to a GPUs with multiple cores of different numbers.
The Cuda parallel programming model is designed to overcome such challenges and maintain a low learning curve for familiarity with standard programming languages such as C.
The core has three key abstractions: the layer of the thread group, the shared memory, and the shielded synchronization-only exposed to the programmer a minimal Language extension set.
These abstractions provide fine-grained data parallelism and thread parallelism, and are embedded into coarse-grained data parallelism and task parallelism. Instruct programmers to divide the problem into coarse sub-problems that can be processed independently and concurrently in multiple threads. Each sub-problem is processed in parallel in each thread. (That is to say, each problem is divided into many subproblems, and each subproblem is processed in a separate thread. The processing of subproblems and subproblems is parallel, each sub-problem is processed in a thread, and the processing is parallel .)
Decomposition maintains language expression and allows threads to work together to solve each subproblem. Auto-shrinking is supported at the same time. Even every thread can be any idle multi-processor in the GPU, in any order, at the same time or consecutively, therefore, the compiled Cuda program can be executed on any number of multi-core cores, 5. Only the system needs to know the number of physical multi-processor at runtime.
This scalable programming model allows the CPU Structure to span a broad market range by simply scaling the number and memory division of Multiple Processors: from high performance enthusiasts-geforce GPUs, professional Quadro and Tesla computing products to a variety of cheap and mainstream geforce GPUs (view the cuda-supported GPU chapter ).
Figure 5 automatic scalability. A gpu is composed of a stream processor (SM) array (More details can be found in the hardware implementation section ). Each multi-threaded program is divided into various thread blocks for independent execution. Therefore, it takes less time to automatically process more cores than to process fewer cores.
1. 4. Document Structure
The document is organized into the following chapters:
- Chapter 1: Introduction-a general introduction to Cuda.
- Chapter 2: programming model-General Cuda programming model.
- Chapter 3: programming interface-describes the C language interface for Cuda programming.
- Chapter 4: hardware implementation-describes the hardware implementation of GPU.
- Chapter 5: programming guide-provides some guidance on how to maximize the CPU performance.
- Appendix: cuda-supported GPUs-list all cuda-supported GPUs.
- Appendix: C Language extension-describes in detail the extension of Cuda to C.
- Appendix: mathematical functions-list mathematical functions supported by Cuda.
- Appendix: C/C ++ language support-lists the features of C ++ supported by device code.
- Appendix: texture pickup-excessive descriptions about texture pickup.
- Appendix: computing power-provides the technical specifications and detailed structure of various devices.
- Appendix: Driver API-describes low-level driver APIs.
- Appendix: Cuda environment variables-list all environment variables.