Reprint: NVIDIA GPU Architecture

Source: Internet
Author: User
Tags require

http://blog.itpub.net/23057064/viewspace-629236/

Nvidia graphics cards on the market are based on the Tesla architecture, divided into G80, G92, GT200 three series. The Tesla architecture is a processor array with the number of extendable places. Each GT200 GPU consists of 240 stream processors (streaming processor,sp), and each of the 8 stream processors is comprised of one stream multiprocessor (streaming multiprocessor,sm), thus a total of 30 stream multiprocessor. When the GPU is working, the workload is passed from the CPU to the GPU memory by the PCI-E bus, distributed from top to bottom, at the hierarchy of the architecture. In the PCI-E 2.0 specification, the downstream data transfer speed of each channel reaches 5.0gbit/s, so that the pci-e2.0x16 slot provides 5.0*16gbit/s=10gb/s bandwidth for each row of data, so the effective bandwidth is 8gb/s, and PCI-E The data bandwidth of the 3.0 specification is 20gb/s. However, due to the impact of PCI-E data packets, the actual available bandwidth is approximately 5-6gb/s (Pci-e 2.0x16). Normal 0 7.8 lb 0 2 false false en-US ZH-CN x-none
In the GT200 architecture, each of the 3 SM components is a TPC (thread processing Cluster, thread processor cluster), while in the G80 architecture, there are two SM components in a tpc,g80 with 8 TPC, because G80 has 128 (2*8*8) A stream processor, and TPC in GT200 increased to 10 (3*10*8), where each TPC has a texture pipeline inside.
Most of the time, call streaming processor as a stream processor, in fact, is not very correct, because if called streaming processor as a stream processor, the nature is implicitly relative to the CPU, but the CPU has a separate set of input and output mechanisms, and streaming processor is not, you can't use printf in GPU programming is an example. It is more appropriate to compare SM with CPU cores. Like the current CPU core, SM also has a full front end.
Each SM of GT200 and G80 contains 8 stream processors. The stream processor also has other names such as thread processors, "cores" and so on, while the latest Fermi architecture gives it a new name: CUDA Core. SP is not a standalone processor core, it has a separate register and program counter (PC), but no reference and dispatch unit to form the complete front end (provided by SM). As a result, the SP is more akin to a pipeline in a contemporary multithreaded CPU. Each time SM launches an instruction, 8 SPS will be executed 4 times each. So a line Cheng (warp) consisting of 32 threads is the smallest execution unit of the Tesla architecture. Since the SP in the GPU is twice times more frequent than other cells in the SM, every two SP cycle SP can access the on-chip memory once, so 32 threads in a warp can be divided into two half-warp, which is why the number of numbers becomes the bottleneck of the operation. The size of the warp has an impact on the operation delay and the latency of the visit, and the warp size of 32 is the result of the Nvidia comprehensive tradeoff.
SM's main execution resources are 8 32bit Alu and Mad (Multiply-add units, multiplier). They can operate on IEEE-compliant single-precision floating-point numbers (corresponding to float) and 32-bit integers (corresponding to int, or unsigned int). Each operation requires 4 clock cycles (SP cycles, not core cycles). Because the four-level pipelining is used, the Alu or mad can take out 8 operands in a warp of 32 threads per clock cycle, and then perform operations and write back results in the next 3 clock cycles.
In each SM, there is also a shared memory, shared memory for shared data and block inline communication for general parallel computing, but because it uses on-chip memory, it is extremely fast and therefore is also used to optimize program performance.
Each SM is calculated by using two special functions (special function UNIT,SFU) cells to perform transcendental functions and attribute interpolation functions (interpolation of pixels based on vertex properties). SFU is used to perform transcendental functions, interpolation, and other special operations. Most of the instructions executed by SFU have a delay of 16 clock cycles, while some complex operations consisting of multiple instructions, such as square roots or exponential operations, require 32 or more clock cycles. The portion of SFU that is used for interpolation has several 32-bit floating-point multiplication units that can be used to multiply operations independent of the floating-point processing unit (float processing UNIT,FPU). SFU actually has two execution units, each of which is 4 services in 8 lines of SM. The multiplication instructions emitted to the SFU also require only 4 clock cycles.
In GT200, each SM also has a double-precision unit for double-precision calculations, but its computing power is less than 1/8 of the single-precision.
Control flow directives (CMP, comparison directives) are performed by branch units. The GPU does not have a branch prediction mechanism, so it will be suspended until all the branch paths have been executed before the branch gets the chance to perform, which can greatly degrade performance.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.