Introduction to Cuda C Programming

Source: Internet
Author: User

Introduction to Cuda C Programming

1. From graphic processing to general parallel computing

Driven by the huge market demand for real-time and HD 3D graphics, programmable graphics processing units or GPUs have evolved into highly parallel, multi-threaded, and multi-core processors with high computing power and high memory bandwidth. 1 and 2.

  

Figure 1 Number of floating point computing times per second between CPU and GPU

 

Figure 2 CPU and GPU memory bandwidth

The difference between the CPU and GPU in floating point computing power is due to GPU-dedicated intensive computing and high parallel computing-just as graphic rendering-so designed to do this, more transistors are used for data processing rather than data caching and stream control, as shown in 3.

Figure 3 GPU uses more transistors for Data Processing

More specifically, GPU is particularly suitable for handling issues that can be expressed as parallel data computing-parallel execution of the same program on many elements-higher computing strength-ratio of computing operations to memory operations. Because the same program is executed on each data element, there is only a relatively low requirement for complex flow control. Because the execution on each data element has a high computing intensity, the memory access latency in the computing is hidden, replacing the large data cache.

Data Parallel Processing maps data elements to parallel processing threads ). Many applications that process large datasets use parallel data programming models to accelerate the computing process. In 3D rendering, pixels and vertices of large sets are mapped to parallel threads for processing. Similarly, image and media processing applications, for example, after-processing of rendered images, video encoding and decoding, image scaling, stereoscopic vision, and pattern recognition can map image regions and pixels to parallel threads for computing. In fact, many areas other than Image Rendering and processing can be accelerated through parallel data processing, from general signal processing, physical simulation to computational finance and computational biology.

 

1.2.cuda: A general parallel computing platform and Programming Model

In November 2006, NVIDIA introduced cuda, a general parallel computing platform and programming model. Using the parallel computing engine in NVIDIA's GPUs, it is more effective than the CPU to solve many complex computing problems. The software environment provided by Cuda allows developers to use C as an advanced programming language. 4. Support for other languages, application interfaces, and command-based methods, such as Fortran, directcompute, and openacc.

 

Figure 4 GPU computing program, Cuda designed to support multiple languages and application interfaces

 

1. 3. A Scalable Programming Model

The emergence of multi-core CPUs and GPUs means that the mainstream processing chips are now parallel systems. In addition, its parallelism continues to increase according to Moore's Law. The challenge for developing application software is to explicitly scale parallelism to take advantage of the number of added processor cores. It is the same as that of a 3D graphics application that explicitly scales its parallelism to a GPUs with multiple cores of different numbers.

The Cuda parallel programming model is designed to overcome such challenges and maintain a low learning curve for familiarity with standard programming languages such as C.

The core has three key abstractions: the layer of the thread group, the shared memory, and the shielded synchronization-only exposed to the programmer a minimal Language extension set.

These abstractions provide fine-grained data parallelism and thread parallelism, and are embedded into coarse-grained data parallelism and task parallelism. Instruct programmers to divide the problem into coarse sub-problems that can be processed independently and concurrently in multiple threads. Each sub-problem is processed in parallel in each thread. (That is to say, each problem is divided into many subproblems, and each subproblem is processed in a separate thread. The processing of subproblems and subproblems is parallel, each sub-problem is processed in a thread, and the processing is parallel .)

Decomposition maintains language expression and allows threads to work together to solve each subproblem. Auto-shrinking is supported at the same time. Even every thread can be any idle multi-processor in the GPU, in any order, at the same time or consecutively, therefore, the compiled Cuda program can be executed on any number of multi-core cores, 5. Only the system needs to know the number of physical multi-processor at runtime.

This scalable programming model allows the CPU Structure to span a broad market range by simply scaling the number and memory division of Multiple Processors: from high performance enthusiasts-geforce GPUs, professional Quadro and Tesla computing products to a variety of cheap and mainstream geforce GPUs (view the cuda-supported GPU chapter ).

Figure 5 automatic scalability. A gpu is composed of a stream processor (SM) array (More details can be found in the hardware implementation section ). Each multi-threaded program is divided into various thread blocks for independent execution. Therefore, it takes less time to automatically process more cores than to process fewer cores.

 

1. 4. Document Structure

The document is organized into the following chapters:

  1. Chapter 1: Introduction-a general introduction to Cuda.
  2. Chapter 2: programming model-General Cuda programming model.
  3. Chapter 3: programming interface-describes the C language interface for Cuda programming.
  4. Chapter 4: hardware implementation-describes the hardware implementation of GPU.
  5. Chapter 5: programming guide-provides some guidance on how to maximize the CPU performance.
  6. Appendix: cuda-supported GPUs-list all cuda-supported GPUs.
  7. Appendix: C Language extension-describes in detail the extension of Cuda to C.
  8. Appendix: mathematical functions-list mathematical functions supported by Cuda.
  9. Appendix: C/C ++ language support-lists the features of C ++ supported by device code.
  10. Appendix: texture pickup-excessive descriptions about texture pickup.
  11. Appendix: computing power-provides the technical specifications and detailed structure of various devices.
  12. Appendix: Driver API-describes low-level driver APIs.
  13. Appendix: Cuda environment variables-list all environment variables.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.