Cuda Study Notes: a preliminary understanding of Cuda

Source: Internet
Author: User

With the development of graphics cards, GPUs become more and more powerful, and GPU optimizes display images. Computing has surpassed general CPU. Such a powerful chip would be too wasteful if it was just a video card, so NVIDIA launched Cuda to allow the video card to be used for purposes other than Image Rendering and computing (for example, general parallel computing mentioned here ). Cuda is the compute unified device architecture. It is an architecture of NVIDIA that uses the GPU platform for general parallel computing. It includes the Cuda instruction set architecture (ISA) and the parallel computing engine inside the GPU. Developers can use C language, opencl, Fortran, and C ++ to write programs for the Cuda architecture. Shows the relationship between them and Cuda:


It is a good reflection of the relationship between Cuda and the application interface (API) and compilers in various languages. The dx11 computing is also direct compute. Including C language extensions, opencl application interfaces, Fortran, and even c ++ used by Cuda compilers, can all run on the Cuda architecture. In the future, Cuda will support more languages. With the joint efforts of the entire industry, GPU computing is promising!

The Cuda architecture consists of three parts: Development Library, runtime environment, and driver.

(1) The development library is an Application Development Library Based on Cuda technology.

(2) The runtime environment provides application development interfaces and runtime components, including the definition of basic data types and various computing, type conversion, memory management, Device Access and execution scheduling functions.

(3) The driver is the device abstraction layer of the cuda-enable GPU and Provides abstract access interfaces for hardware devices. Cuda provides the runtime environment to implement various functions through this layer. Currently, applications developed in Cuda must support NVIDIA cuda-enable hardware. The relationships between CPU, GPU, application, Cuda development library, runtime environment, and drivers are shown in:


In the Cuda architecture, a program is divided into two parts: Host and device. The host side is the part executed on the CPU, while the device side is the part executed on the display chip (GPU. The device program is also called "kernel ". Generally, the host program copies the data to the memory of the video card, and then the display chip executes the device program, then, the host Program retrieves the result from the memory of the video card. Because the CPU can only access the memory of the video card through the PCI Express interface, the speed is slow (the theoretical bandwidth of PCI Express x16 is 4 Gb/s in both directions). Therefore, this type of action cannot be performed frequently to avoid lower efficiency.

In Cuda architecture, the minimum unit for display chip execution is thread. Multiple Threads can form a block. The thread in a block can access the same shared memory and can perform synchronization quickly. The threads in different blocks cannot access the same shared memory, so they cannot be directly interconnected or synchronized. Therefore, the degree of cooperation between threads in different blocks is relatively low. However, with this mode, the program does not have to worry about the number of threads that the display chip can actually execute at the same time. For example, a display chip with a small number of execution units may execute the threads in each block sequentially rather than simultaneously. Different grids can execute different programs (namely, kernel ). Shows the relationship between grid, block, and thread:


Each thread has its own space for register and local memory. Each thread in the same block has a shared memory. In addition, all threads (including threads of different blocks) share a global memory, constant memory, and texture memory. Different grids have their own global memory, constant memory, and texture memory. As shown in:

Due to the large number of Parallel Computing Features of the display chip, it handles some problems in a different way than the general CPU. Main features include:

1. memory access latency (wait time): the CPU usually uses cache to reduce the number of times the primary memory is accessed, so as to avoid the impact of memory latency on execution efficiency. The display chip is mostly not cached (or small), and the latency of the memory is hidden by means of parallel execution (that is, when the first thread needs to wait for the memory to read the results, execute the second thread, and so on ).

2. Branch command problem: the CPU usually uses branch prediction and other methods to reduce the pipeline (pipeline) bubble caused by branch commands. Most display chips use a method similar to memory latency processing. However, it is usually shown that the efficiency of the chip processing branch is relatively poor.

Therefore, the most suitable issue for processing with Cuda is that a large number of parallel tasks can be used to effectively hide the latency of memory and effectively use a large number of execution units on the display chip. It is normal to run thousands of threads simultaneously when using Cuda. Therefore, if you cannot solve a large number of parallel problems, you cannot achieve the best efficiency by using Cuda. In this process, the CPU is responsible for controlling GPU execution, scheduling and assigning tasks, and doing some simple computing. A large amount of work that requires parallel computing is handed over to the GPU for implementation. In addition, because the CPU can only access the video memory through the PCI-Express interface, the speed is slow, so it cannot be performed frequently to avoid reducing efficiency. Generally, data can be copied to the GPU memory at the beginning of the program, and then computed in the GPU until the required data is obtained and then copied to the system memory.


Reprinted on http://blog.csdn.net/carson2005/article/details/7694605


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.