CUDA, the software abstraction behind the Phantom of the Second

Source: Internet
Author: User

This article is originally contained in my homepage:planckscale.info, reproduced here. Copyright Notice: Original works, welcome reprint, but reproduced please indicate the source of the article (Planckscale.info), author information and this statement in the form of hyperlinks, otherwise the legal liability will be investigated.

In the previous article, two points to Cuda's computational power are very significant: data parallelism, and the use of multithreading to mask the delay. Next we'll go deep into their hardware implementations and see how these mechanisms work.

It is often said that a GPU has a hundreds of or even thousands of cuda core, which is easy to associate with multicore CPUs. But the fact that the two "core" is not the same concept, the cuda core of the GPU is equivalent to the execution unit in the processor, is responsible for executing instructions to perform operations, does not include the control unit. Analog to the CPU core is a stream multiprocessor (streaming multiprocessor, abbreviated as SM. Kepler is called SMM in Smx,maxwell, usually a GPU has several SM, each of which contains dozens of or hundreds of cuda cores, as well as several warp scheduler (equivalent to the control unit). For example, there are 16 sm in GM204, 128 cuda cores in each SM, and 4 Warp scheduler.

Figure 1. SM structure diagram of GM204

There are a lot of register resources in each SM, and in the case of GM204, there are a total of 64k 32-bit registers that can feed thousands of threads. Another important resource in SM is shared memory, which is exactly the counterpart of shared memory in software abstraction. In GM204, each SM has a shared Memory of 96KB.

Here, the correspondence of SM in the software abstraction is also on the horizon, yes, it is block. We might as well start with this correspondence:
Block <-> SM
Thread execution <-> CUDA cores
Thread Data <-> register/local Memory

Different blocks under the same grid are distributed to different SM for execution. There may be multiple blocks on the SM that are executed, and they do not necessarily come from the same kernel function. The local variables in each thread are mapped to the SM registers, and the thread execution is done by the Cuda core.

SM?" This is determined by the consumption of the hardware resources: Each SM occupies a certain number of registers and shared Memory, so the number of blocks that are simultaneously surviving on SM should not exceed the limits of these hardware resources. Since the SM can have a block from different kernel at the same time, sometimes even if the remaining resources on SM are not enough to accommodate a block of kernel A, it may still accommodate the next block of kernel B.

The next important question is how the block is implemented. As we can see, the cuda cores on SM are finite, they represent the number of threads that can actually be physically parallel--in software abstraction, all threads in a block are executed in parallel, which is a logically unassailable abstraction, In fact, it is not possible to give an equivalent size Cuda core array to a block of arbitrary size to actually execute them in parallel. The
thus has the concept of warp: physically, blocks are divided into chunks that are mapped to the Cuda core array, each of which is called a warp. At present, the warp in Cuda are starting from threadidx = 0, THREADIDX continuous 32 threads for a set of division, even if the last remaining thread less than 32, also as a Warp.cuda kernel configuration, we often set the block size to 32 integer times, It is precisely to allow it to be accurately divided into the whole number of warp (more profound reason and memory access performance related, but in this case still and warp size is not involved).
in the SM structure of GM204, we can see that SM is divided into four identical blocks, each with a separate warp Scheduler, and 32 cuda cores. It was here that warp was executed. The execution of the
Warp is very similar to SIMD. The active thread in the warp is driven by the warp scheduler and executed synchronously. We can see that the 32 cuda cores in GM204 share a warp Scheduler. For some of the more complex issues that may arise in the implementation of the warp, stay below.

Now you can sort out the picture of the world. Several blocks exist on the SM, and each block has its own registers and the shared Memory,block is divided into 32 threads of warp. In this way, a large number of warp live on SM, waiting to be dispatched to the Cuda core array to execute.

WARP Scheduler, as its name, is the dispatcher in this Warp world. When a Warp is waiting in execution (memory read-write delay, etc.), WARP Scheduler quickly switches to the next executable Warp and sends instructions to it until the Warp waits again and again. This is what the previous article described as "masking latency with multithreading" in the hardware landscape.

Figure 2. GPU uses multiple warp to mask latency/vs. CPU calculation mode

This figure is referenced from the PPT "CUDA Overview" from Cliff Woolley, NVIDIA.

, the GPU quickly switches with multiple warp to mask the delay, while the CPU uses a fast register to reduce latency. The important difference between the two is the number of registers, the CPU registers are fast but few, so the context switch is expensive, the GPU registers are much slower, but the number of registers guarantees that the thread context switch is very fast.

How many lines of friend can cover up common delays? For GPUs, the most common latency is the register write post-read dependency, where a local variable is assigned and then read shortly thereafter, resulting in a delay of about 24 clock cycles. To cover this delay, we need at least 24 warp to take turns, and one warp to perform the remaining 23 warp in the idle time after the delay to keep the hardware busy. With 32 cuda cores in the compute Capability 2.0,SM, with an average of one instruction per cycle, we need 24*32 = 768 threads to mask the delay.
Keeping the hardware busy, in cuda terms, is to maintain full occupancy, which is an important indicator of CUDA program optimization.

(not to be continued)

CUDA, the software abstraction behind the Phantom of the Second

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.