Life of a Triangle-nvidia ' s logical pipeline

Source: Internet
Author: User


    1. Home
    2. GameWorks
    3. Blog
    4. Life of a Triangle-nvidia ' s logical pipeline
Life of a Triangle-nvidia ' s logical pipeline Facebook Twitter LinkedIn Google + ByChristoph Kubisch, posted Mar at 12:52pmTags:gameworksgameworks Expert developerdx12dx11

Since the release of the ground breaking Fermi architecture almost 5 years has gone by, it might is time to refresh the P Rinciple graphics architecture beneath it. Fermi is the first NVIDIA GPU implementing a fully scalable graphics engine and its core architecture can is found in Kep Ler as well as Maxwell. The following article and especially the "compressed pipeline knowledge" image below should serve as a primer based on the Various public materials, such as whitepapers or GTC tutorials about the GPU architecture. This article focuses on the graphics viewpoint on how the GPU works, although some principles such as what shader program C Ode gets executed is the same for compute.

    • Fermi Whitepaper
    • Kepler Whitepaper
    • Maxwell Whitepaper
    • Fast tessellated Rendering on Fermi GF100
    • Programming guidelines and GPU Architecture reasons Behind them
Pipeline Architecture Image

GPUs is super parallel work distributors

Why is this complexity? In the graphics we have to deal with the data amplification that creates lots of variable workloads. Each drawcall generate a different amount of triangles. The amount of vertices after clipping are different from what we triangles were originally made of. After back-face and depth culling, not all triangles could need pixels on the screen. The screen size of a triangle can mean it requires millions of pixels or none at all.

As a consequence modern GPUs let their primitives (triangles, lines, points) follow a logical pipeline, not a physical Pipeline. In the Before G80 ' s unified Architecture (think DX9 hardware, PS3, Xbox360), the pipeline is represented on the Chip with the different stages and work would run through it one after another. G80 essentially reused some units for both vertex and fragment shader computations, depending on the load, but it still ha D A serial process for the primitives/rasterization and so on. With Fermi the pipeline became fully parallel, which means the chip implements a logical pipeline (the steps a triangle go Es through) by reusing multiple engines on the chip.

Let's say we have both triangles A and B. Parts of their work could is in different logical pipeline steps. A have already been transformed and needs to be rasterized. Some of its pixels could is running pixel-shader instructions already, while others is being rejected by Depth-buffer (Z- Cull), others could is already being written to framebuffer, and some may actually wait. And next to all, we could is fetching the vertices of triangle B. All triangle have to go through the logical steps, lots of them could is actively processed at different steps of Their lifetime. The job (get Drawcall ' s triangles on screens) is a split into many smaller tasks and even subtasks that can run in parallel. Each task was scheduled to the resources that was available, which is isn't limited to tasks of a certain type (Vertex-shadin g parallel to pixel-shading).

Think of a river the fans out. Parallel pipeline streams, that is independent of each of the other, everyone on their own time line, some could branch more than Others. If We would color-code the units of a GPU based on the triangle, or drawcall it's currently working on, it would is multi- Color blinkenlights:)

GPU Architecture

Since Fermi NVIDIA has a similar principle architecture. There is a , Giga Thread Engine  which manages all the work that's going on. The GPU is partitioned to multiple  gpcs   (Graphics processing Cluster), each have multiple& nbsp SMs   (streaming multiprocessor) and one Raster Engine . There is lots of interconnects in this process, most notably a  Crossbar  that allows work Migrat Ion across gpcs or other functional units like  ROP   (render output unit) subsystems.

The work, a programmer thinks of (shader program execution) is do on the SMs. It contains many  cores  which do the math operations for the threads. One thread could is a vertex-, or pixel-shader invocation for example. Those cores and other units is driven by  Warp schedulers , which manage a group of threads as war P and hand over the instructions to be performed to  Dispatch Units . The code logic is handled by the scheduler and not inside a core itself, which just sees something like  sum Regi Ster 4234 with register 4235 and store in 4230 "  from the dispatcher. A core itself is rather dumb, and compared to a CPUs where a core is pretty smart. The GPU puts the smartness into higher levels, it conducts the work of a entire ensemble (or multiple if you'll).

How many of these units is actually on the GPU (what many SMs per GPC, how many gpcs ...) depends on the chip configuration itself. As you can see above GM204 have 4 gpcs with all 4 SMS, but Tegra X1 for example have 1 GPC and 2 SMS, both with Maxwell des Ign. The SM design itself (number of cores, instruction units, schedulers ...) have also changed over time from generation to Gen Eration (see first image) and helped making the chips so efficient they can is scaled from high-end desktop to notebook to Mobile.

The logical pipeline

For the sake of simplicity several details is omitted. We assume the Drawcall references some Index-and vertexbuffer that's already filled with data and lives in the DRAM of T The He GPU and uses only Vertex-and pixelshader (Gl:fragmentshader).



    1. The program makes a drawcall in the graphics API (DX or GL). This reaches the driver at some point which does a bit of validation to check if things is "legal" and inserts the Comman D in a gpu-readable encoding inside a pushbuffer. A lot of bottlenecks can happen here on the CPU side of things, which are why it's important programmers use APIs D techniques that leverage the power of today ' s GPUs.
    2. After a while or explicit "flush" calls, the driver have buffered up enough work in a pushbuffer and sends it-be process Ed by the GPU (with some involvement of the OS). The Host Interface of the GPU picks up the commands which is processed via theFront End.
    3. We start our work distribution in the Primitive distributor by processing the indices in the IndexBuffer and gene Rating Triangle Work batches This we send out to multiple gpcs.


    1. Within a GPC, the Poly Morph Engine of one of the SMs takes care's fetching the vertex data from the triangle in Dices (Vertex Fetch).
    2. After the data has been fetched, warps of threads is scheduled inside the SM and would be working on the vertices.
    3. The SM ' s warp Scheduler issues the instructions for the entire warp in-order. The threads run instruction in Lock-step and can is masked out individually if they should not actively execute it. There can multiple reasons for requiring such masking. For example when the current instruction are part of the "if (true)" Branch and the thread specific data evaluated "false", Or when a loop ' s termination criteria is reached in one thread and not another. Therefore has lots of branch divergence in a shader can increase the time spent for all threads in the warp significant Ly. Threads cannot advance individually, only as a warp! Warps, however, is independent of each of the other.
    4. The warp ' s instruction is completed at once or could take several dispatch. For example the SM typically have less units for load/store than doing basic math operations.
    5. As some instructions take longer to complete than others, especially memory loads, the warp scheduler could simply switch to Another warp that's not a waiting for memory. This is the key concept how GPUs overcome latency of memory reads, they simply switch out groups of active threads. Switching very fast, all threads managed by the scheduler has their own registers in the Register-file. The more registers a shader program needs, the less threads/warps has space. The less warps we can switch between, the less useful work we can do while waiting for instructions to complete (foremost Memory fetches).


      1. Once the warp have completed all instructions of the vertex-shader, it's results is being processed by Viewport Transf ORM. The triangle gets clipped by the clipspace volume and was ready for rasterization. We use L1 and L2 Caches for all this cross-task communication data.


      1. Now it gets exciting, we triangle is about to be chopped up and potentially leaving the GPC it currently lives on. The bounding box of the triangle is used to decide which raster engines need to work on it, as each engine covers multiple Tiles of the screen. It sends out of the triangle to one or multiple gpcs via the work distribution Crossbar. We effectively split our triangle to lots of smaller jobs now.


    1. Attribute Setup at the target SM would ensure that the interpolants (for example the outputs we generated in a ver Tex-shader) is in a pixel shader friendly format.
    2. The Raster Engine of a GPC works on the triangle it received and generates the pixel information for those sectio NS that it's responsible for (also handles back-face culling and z-cull).
    3. Again We batch up pixel threads, or better say 8 times 2×2 Pixel quads, which is the smallest unit we'll always work With in pixel shaders. This 2×2 quad allows us to calculate derivatives for things like texture MIP map filtering (big change in texture Coordina TES within quad causes higher MIP). Those threads within the 2x2 quad whose sample locations is not actually covering the triangle, is masked out (gl_helper invocation). One of the local SM ' s Warp Scheduler would manage the pixel-shading task.
    4. The same warp scheduler instruction game, that we had in the Vertex-shader logical stage, was now performed on the Pixel-sh Ader threads. The Lock-step processing is particularly handy because we can access the values within a pixel quad almost for free, as Al L threads is guaranteed to has their data computed up to the same instruction point (Nv_shader_thread_group).


    1. is we there yet? Almost, our pixel-shader had completed the calculation of the colors to being written to the rendertargets and we also had a Depth value. At the original API ordering of triangles in account before we hand that data over to one of The ROP (render output unit) subsystems, which in itself have multiple ROP units. Here depth-testing, blending with the framebuffer and so on is performed. These operations need to happen atomically (one color/depth set at a time) to ensure we don't have one triangle's color an D Another triangle ' s depth value when both cover the same pixel. NVIDIA typically applies memory compression, to reduce memory bandwidth requirements, which increases "effective" Bandwidt H (SEEGTX 980 pdf).

puh! We were done and we had written some pixel into a rendertarget. I Hope this information is helpful to understand some of the work/data flow within a GPU. It may also help understand another side-effect of what synchronization with the CPU is really hurtful. One have to wait until everything are finished and no new work are submitted (all units become idle), which means when sending New work, it takes a while until everything are fully under load again, especially on the big GPUs.

In the image below you can see how do we rendered a CAD model and colored it by the different SMs or warp IDs that contribute D to the image (Nv_shader_thread_group). The result would not being frame-coherent, as the work distribution would vary frame to frame. The scene was rendered using many drawcalls, of which several may also being processed in parallel (using Nsight Some of that drawcall parallelism as well).



Further reading
    • A trip through the Graphics-pipeline by Fabian Giesen
    • Performance optimization guidelines and the GPU Architecture behind them by Paulius Micikevicius
    • Pomegranate:a Fully Scalable Graphics Architecture describes the concept of parallel stages and work distribution between them.

Life of a Triangle-nvidia ' s logical pipeline

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.