Graphics Pipeline Tour Part3

Last Update:2015-07-08 Source: Internet

Author: User

Tags pixel coloring

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: "A trip through the Graphics Pipeline 2011" Translation: Sword of the past Reprint Please indicate the sourceAt this point, we send draw call from the application along the way through multiple drive layers and command processors. Finally, we have to do graphics processing. In the final section, take a look at the vertex pipeline. But before you start ... some nounsOur current 3D pipeline consists of several stages, each with special functions. To give the names of the stages to be talked about-basically in accordance with the D3D10/11 naming structure-plus the corresponding abbreviations. We will see them in the final part of the journey, but it will take some time to see them all-I have written an outline that summarizes what has been done at each stage in a nutshell.

ia--input Assembler. Read the index and vertex data.
vs--vertex shader (Vertex Shader). Gets the input vertex data and writes the vertex data to be used for the next stage.
pa--Elements Assembly (Primitive assimbly). Reads the vertex data and assembles the elements to continue the pass.
hs--Shell Shader (Hull Shader). Receives patch Primitives, writes a transformed (or not transformed) patch control point to the domain shader (field Shader), plus additional data that drives the subdivision.
ts--Subdivision phase (Tessellator stage). Create vertices and connect segments of straight or triangular faces.
ds--field shader (Domain Shader). Remove the shaded shaded control points, the additional data in the HS, and the subdivision locations in TS, and transform them into vertices again.
gs--Geometry Shader (Geometry Shader). Enter the entities, select the adjacency information, and then output to different entities.
so--output Stream (stream-out). Writes the GS output (such as a transformed entity) to a memory buffer.
rs--Grating (Rasterizer). Rasterized Graphics.
ps--Pixel shader (Pixel Shader). The interpolated vertex data is output pixel color. You can also write to the UAV (unordered access view Unordered access view).
om--output Mixer (outputs merger). After getting the pixels from PS, do the semi-transparent mixing process and put them back in the buffer.
cs--Compute Shader (Compute Shader). Own a separate pipeline. Enter only constant buffers and thread IDs. Buffers and UAVs can be written.

Now that it's all over, here are a list of data flows, and I'll do it in order (I'm not talking about the IA, PA, RS, om stages, they're not related to the subject, they don't do anything with the data, they just reorganize the data-like adhesives)

Vs→ps: A pipeline with a long history. In the D3d9 era, this is the whole pipeline. So far, it is still an important process for regular rendering. I go through it from the beginning, and then I look back for a richer process.
Vs→gs→ps: Geometric coloring (New in D3D10)
Vs→hs→ts→ds→ps,vs→hs→ts→ds→gs→ps: Surface subdivision (new in D3d11)
Vs→so,vs→gs→so,vs→hs→ts→ds→gs→so: Output stream (with/without surface subdivision)
CS: Calculation (New in D3d11)

Now you know what's going to happen next, let's start with vertex shader. Input assembly stage (input assembler stage)The first thing that happens here is to load the index from the indexes buffer--if it's a render batch that contains an index. If not, consider the index Buffer with the same serial number (0 1 2 3 4 ...). ) to use as an index. If there is an index Buffer, its contents are not read from memory, and the IA phase typically accesses Index/vertex buffer through a data cache. Also note that the read index buffer (in fact, all resource accesses above D3D10) is checked for bounds, if you reference an element other than the original index buffer (for example, in index buffer with only 5 indexes, Execute the drawindexed function with the Indexcount parameter set to 6) all out-of-bounds reads will return 0. Doing so (this particular case) is completely useless, but it contains a certain meaning. Similarly, you can call drawindexed--with a NULL index buffer collection if your index buffer length is set to 0, this is the same, and all reads are out of bounds, so it also returns 0. Above d3d10, you need to do a little more with undefined stuff: once indexed, We have preprocessing vertices (Pre-vertex) and preprocessed instance (pre-instance) data that need to be read from the vertex data stream (the instance ID for this phase is just a simple counter). This is simple-we have declared the data layout, read it from cache/memory, and unpack it into floating-point format as the input data of the shader kernel. However, the read is not done immediately; the hardware uses a cache of shaded vertices so that vertices can be referenced by multiple triangles (in a regular closure mesh, each vertex is referenced by 6 triangles), instead of repeating the same vertex every time-we only refer to the data that has been shaded. vertex caching and Shading Note: This section contains guesswork and is based on comments from experts on modern GPUs. But only to tell me what is, but not explain why, all this piece is inferred. Also, I just guessed some details. That is, what I do not know is not fully explained here--I describe things that I think are trustworthy, I can not guarantee that the actual hardware is actually implemented in this way, I would probably miss some tips and details. For a long time (until shader Model 3.0), vertex & pixel shader are implemented using different processing units, and they have different performance tradeoffs and vertex caches, which are simple things: generally just A FIFO with a small number of vertices, leaving enough space for the worst output properties, each marked with vertices and indexes. It's simple. After , Unified shader appeared. If the different things are handled uniformly in two types of shader, this design will inevitably make concessions. In other words, the Vertex shader usually reaches 1 million vertices at a frame, while pixel shader fills full screen A frame at least requires 2.3 million pixels at the 1920x1200 resolution-there will be more rendering content. So guess what the processing unit is going to do? There is a workaround: Replace the old vertex shader uint that renders only a few vertices at a time with a large number of unified shading units (Unified shader unit) to maximize throughput and avoid delays, so that you can handle large batches of rendering work (how big?). Currently this number seems to be a batch processing 16~64 coloring vertex). If you do not want to reduce rendering efficiency, you will have 16~64 vertex cache miss before you perform a vertex-shaded load (vertex shading load). But the whole FIFO is actually not the idea of batching vertex cache miss, and rendering them out in one breath. Because the problem is: if you render the entire batch of vertices at once, you can only begin to assemble the triangles after the vertex is shaded. At this point, you just added a whole vertex batch (like here 32) to the FIFO's tail, which means that there are now 32 old vertices being squeezed out of the queue-but each of the 32 vertices may have hit the vertex cache of the triangle we are assembling in the current batch. Oh! Then it won't work. Obviously, we can't actually count 32 old vertices in the FIFO as vertex cache hits (vertex cache hits), because the vertex being referenced is already gone! So how much FIFO do we need? If we are renderingThere are 32 vertices in a batch, at least requires a large space of 32 entries, but since we cannot use 32 old entries (because we are moving out of them), it means that each batch is actually an empty FIFO. Let's make it a little bigger, what about 64 entries? It's pretty big. Note that each vertex cache lookup involves comparing the tokens (vertex ordinals) in all FIFO--which is completely parallel, but also consumes power; Here we use a fully associative cache to implement it efficiently. Also, what do you do between distributing the 32 vertex shading payload and receiving the results-just wait? Coloring takes hundreds of cycle, waiting is not a good idea! Perhaps a colleague should have two coloring load, parallel execution? But now our FIFO needs at least 64 entries in length, and we cannot count the last 64 entries as vertex hits, because when we receive the results, they will all be moved out of the queue. Also, does a FIFO correspond to a large number of shader cores? Amdahl Law--a series of fully serialized components (not parallelized) added to the pipeline, will inevitably produce a performance bottleneck. The whole FIFO really doesn't fit this environment, so, well, we can only abandon him. Come back to the artboard. What do we actually want to do? Get a vertex batch of the right size to render, without rendering unnecessary vertices. So, okay, simple: keep enough cache space for 32 vertices (1 batches), and set aside the cache space for 32 entries. Starting with an empty buffer, for example, all entries are illegal. For each element in the index buffer, look it up from all the vertices, and if he hits the cache, it's best. If you die, allocate a slot in the current batch and add a new index to the cache tag array (the cache tag array). When we don't have enough space to add a new entity, we distribute all the vertex shading batches, save the Cache tag array (for example, the 32 vertex indexes that have just been shaded), and set the next batch again from an empty cache-ensuring that the render batches are completely independent. Each batch will occupy shader unit for a period of time (may be at least hundreds of cycles! ）。 But this will not be a problem, because we have enough shader uint--only need to select a different shader unit to execute each batch! We can finally get the results back efficiently and in parallel. At this point we can use the saved cache tag and the Yuan technique index buffer data to assemble the elements and send them to the pipeline (this is the concept of "entity assembly" that I'll talk about later in this section). By the way, what do I mean by "get the return result" I just said? them inWhat's the end? There are two main options: 1. In a particular cache or 2. Some common cache/temporary memory. In the past, it used to be a cache with a fixed organizational structure around the vertex data (1 per vertex with 16 float4 vectors, and so on), but then the GPU began to evolve toward option 2, just memory. This is very flexible, an important benefit is that you can use this memory at other shader stages, but for example, a particular vertex cache is useless for pixel coloring or computational management. The vertex coloring data flow described so far Shader Unit InternalIn short: This is the disassembly output (fxc/dumpbin) of the HLSL compiler you want to see. It is just a processor that is good at executing this code, and is responsible for compiling some shader code into approximate shader bytecode in the hardware. Unlike what I've been talking about, this piece of content has a lot of information-if you're interested, you can find some conference presentations from AMD and NVIDIA, or read the documentation for the Cuda/stream SDK. Induction: High-speed Alu is mainly arranged around the Fmac (floating-point multiplication accumulation floating multiply-accumulate) unit, some hardware support reciprocal, reciprocal square root, log2,exp2,sin,cos operation, high throughput and high density without delay optimization, Running a large number of threads to reduce the delay, each thread has very few registers (because there are too many threads running!). ), which is ideal for executing direct code (without loops) and not for running branches (especially incoherent code). The above is usually the implementation of all. There are some differences; AMD's hardware typically uses a 4-bit-wide SIMD representation of HLSL/GLSL and shader (although they are not later), and Nvidia intends to convert the 4-channel SIMD to scalar instructions (scalar instruction) shortly before. Again, all of this has data on the web. Tail NoteI again disclaimer "Vertex Caching and Shading" section: Some of them are my guesses, so it's a little unclear. I'm not going to tell you how to write the details of the cache, which is partly managed; the cache size depends on the size of the processing batch and the vertex attributes you want to output. Cache size and management are important for performance, but I don't explain it in detail, and I don't want to explain it, although it's interesting, but this part is very special with the hardware that we're talking about.

Graphics Pipeline Tour Part3

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More