Now, our APP has passed through various driver layers and instruction processors. What we need to do today is to perform real graph processing. I will introduce the vertex assembly line. But before we start, let's get to know each other.
AbbreviationsThere are multiple stages in the 3D assembly line, each of which completes specific work. The names shown below will be mentioned later. Of course, most of them are the same as the official names of D3D10/11, but some abbreviations are added. In many chapters in the future, I will explain them one by one, knowing the end of this series. OK, first give these abbreviations and their brief descriptions.
- IA -- Input Assembler: reads the index and vertex data.
- VS -- Vertex Shader: obtains the input Vertex data and outputs the processed Vertex data for the next stage.
- PA-Primitive Assembly: Read vertices, assemble them into elements, and pass them out.
- HS-shell Shader: receives the patch elements and outputs the patch control points after the transformation (not changed) as the input of domain shader, add some additional data to drive the surface subdivision.
- TS-Surface Subdivision Stage: creates new vertices and links between segments and triangles.
- DS-Domain Shader: integrates and converts the colored control points from HS, additional data, and segmented positions from TS into a series of vertices.
- GS-Geometry Shader: the input is the graph element and (optional) connection information, and the output is a thousand different elements. It mainly serves as a hub.
- SO -- Stream output: writes GS output (such as converted elements) to a cache in the memory.
- RS-Rasterization (Rasterizer): Rasterization of various elements.
- PS-Pixel Shader: obtains interpolation vertex data and outputs Pixel colors. It can also be written to UAVs (unorder access views ).
- OM -- Output Merger: obtains pixels from PS, performs alpha mixing, and writes them back to the cache.
- CS-Compute Shader: Independent assembly line. The unique input is the constant buffer and thread ID, which can be written to the cache and UAVs.
The following lists various data stream paths. I will introduce them in sequence.
- VS-> early PS programmable pipeline. In the D3D9 period, this is all you can control. So far, this is still the most important path for regular rendering. I will first talk about this path from start to end and then introduce a more advanced path.
- VS-> GS-> PS added geometric coloring (New Features of D3D10)
- VS-> HS-> TS-> DS-> PS, VS-> HS> TS-> DS-> GS-> PS: added a subdivision surface (New Features of D3D11)
- VS-> SO, VS-> GS-> SO, VS-> HS-> TS-> DS-> GS-> SO: added stream output (Subdivision Surface optional)
- CS: Add a computing coloring device (New Features of D3D11)
Now you know what is next. Let's start with the vertex coloring tool.
Input assembler stageThe first thing that happens is to load the index from the index cache-the premise is that the cache is an indexed batch. If not, pretend it is an ID index cache (0 1 2 3 4 ....), Use ID as an index. The content in the index cache is generally not directly read from the memory. IA usually uses a data cache to access the index/vertex cache. Note that the boundary detection is performed when the index cache reads content (D3D10 + all accessible resources. If the referenced element is outside the original index cache (for example, you can set IndexCount in the cache with only five indexes)
= 6 call DrawIndexed), and the return value for all cross-border access is 0. Similarly, you can set a NULL index cache to call DrawIndexed, which is equivalent to having an index cache with a size of 0. In this way, all future reads will be out-of-bounds and 0 will be returned. With indexes, we obtain all the required vertices and instance data from the input vertex stream (in this phase, the current instance ID is a counter, which is very simple ). It's really straightforward-we have a declaration of data format. We only need to read from the cache/memory and decompress it into the floating point format required by the coloring tool kernel. This reading mode is not the immediate mode. The hardware is running the cache of colored vertices. Therefore, if a vertex is referenced by multiple triangles, it does not need to be colored every time-we only need to reference the existing colored data.
Cache and coloring of verticesNote: This part is just "speculation" to some extent ". The two things in the title are based on public comments made by "informed people" on the current GPUs. However, there is only "WHAT" but not "WHY", so there will be some personal inferences here. Similarly, I simply guess some details. It means that what I'm talking about here is within my understanding-but I'm confident that these are trustworthy and reliable. For a long time (until GPUs contains shader model 3.0), the vertex shader and the pixel shader are used in different units and have different performance trade-offs, besides, vertex caching is quite simple. Generally, a small number of (a dozen or two dozen) vertices are one FIFO, and there is enough space for the number of output attributes in the worst case, using the vertex index as a marker. Look, it's quite simple and straightforward. Then, a uniform coloring tool is displayed. If you want to unify two different coloring programs, this design becomes very necessary. Think about it. On the one hand, your vertex coloring unit needs to be associated with about 1 million vertices at the same time per frame for general application. On the other hand, your pixel coloring unit requires at least 2.3 million pixels per frame to fill the screen at a full screen of 1920x1200 -- if you want to render more, that may require more pixels. Which unit will slow down the speed? Okay, the solution is here: discard an outdated Vertex coloring unit that processes a vertex and replace it with a powerful uniform coloring unit. It has the maximum throughput without latency, since then, we have started to work in large batches (how big? This number is 16 to 64 vertices per batch ). If you don't want low coloring efficiency, You Need To Cache 16 to 64 vertices until you can allocate a vertex shader for loading, but the entire FIFO will not be colored at once. Problem: if you color the entire batch of vertices at a time, it means that you have to wait until all vertices are colored to assemble them into triangles. By then, you have added the entire batch (Unified into 32) vertices at the end of the FIFO, this means that 32 vertices have been squeezed out, and each vertex may have been hit by the currently assembling triangle vertex cache. Look, this obviously cannot work. We are referencing the 32 old vertices that are soon to be squeezed out and cannot regard them as vertex cache hits in FIFO. Also, how big is the FIFO? If a batch of 32 vertices are colored, it requires at least 32 entries, and we cannot use the old 32 entries (because they are being used/removed ), this means that each batch will encounter an empty FIFO. So how about making it bigger? How about 64 entries? It looks big enough. Each vertex cache query compares the label (vertex index) with all the labels in the FIFO. This operation is highly parallel, but it is time consuming, we can efficiently implement a fully-associated high-speed cache here. What should we do during the time when a 32-vertex coloring loader is assigned and the result is received? -- Can only wait? Coloring takes hundreds of cycles, and drying is very SB. How can we use two coloring loaders for Parallel Processing? But now our FIFO requires at least 64 entries, so we cannot regard the last 64 entries as vertex cache hits, because they will be removed when we accept the results. What's more, we need a FIFO to combat many coloring cores? Don't forget that Amdahl's law is valid here. This kind of integrated FIFO is not very suitable for this type of environment, so let's start over. What do we really want to do? Obtain a batch of appropriate vertices for coloring. You do not need to color more than the required number of vertices. Therefore, simply put, you can reserve sufficient cache space for 32 vertices (batch times), as well as high-speed cache tag space for 32 entries. Then start with an empty "cache". For example, all entries are invalid. Perform a query on all indexes for each element in the index cache. If you hit a high-speed cache, OK. If no, a trail is assigned to the current batch to add a new index to the cache tag array. Once the remaining space is insufficient, add new elements, allocate the entire batch to the vertex coloring tool, and save the cache label array (for example, 32 indexes of the vertex that has just been colored ), set the next batch and start with an empty high-speed cache to ensure full independence between batches. Each batch will take a coloring unit for a while (at least hundreds of cycles ). But it doesn't matter, because we have enough coloring unit-we only need to find another unit to execute another batch. Highly parallel. The final result is returned. At that time, we will use the saved high-speed cache labels and original index cache data to assemble the metadata and then send it to the assembly line-this is "metadata assembly", which I will talk about later. By the way, what does "get result return" mean. Where did we end? There are two options: 1. Dedicated cache 2. Some general high-speed caches/latches, which used to be 1, based on the vertex data designed with a fixed structure (each vertex has 16 float4 vector attribute spaces), the recent GPU tends to be 2. 2. It is more flexible and has obvious advantages-you can use this memory for other coloring stages. On the contrary, the dedicated vertex cache is useless for Pixel coloring or computing pipelines. Is the vertex coloring data stream described so far
Internal component of the coloring UnitIntroduction: This is almost what you expect from the HLSL compiler. It is just a processor that is good at running certain types of code. What it does on hardware is to compile the bytecode of the coloring program into something. Unlike the one I mentioned earlier, this document is very comprehensive-if you are interested, you can take a look at the AMD/NVidia Conference demonstration or read the documents in the CUDA/Stream SDK. Execution Overview: Fast ALU is mainly located around a FMAC (Floating Multiply-ACcumulate) unit. Some HW support reciprocal, square root reciprocal, log2, exp2, sin, cos, optimization of high throughput, high density, no low latency, the amount of threads to execute to conceal latency, each thread only has a small number of registers (because there are too many threads), very good at executing linear code, not good at the branch structure. Almost all of the above are executed in total. Of course there are also some differences. AMD's Hardware habits directly use the default 4-bit SIMD in HLSL/GLSL and coloring program bytecode (which seems to have been changed recently ), A while ago, NVidia tends to convert four-bit width SIMD into scalar commands. There is an intersection between them. What makes sense is the difference in various coloring stages. The introduction is very simple. For example, all algorithms and logic commands are identical at all stages. Some designs (such as derived commands and interpolation features in pixel coloring devices) are only stored in some stages, but the main difference is the different data types transmitted by input and output. Of course, there is another special thing related to the coloring device, that is, texture sampling (Texture unit). This topic is a bit large, so I will list it in a single chapter, next Chapter.
Conclusion
Again, the "cache and coloring of vertices" section is a bit of my guess, so I cannot believe it all, so I have to keep some comments. I have not elaborated on the details of how the cache/temporary storage is managed. The cache size (mainly) depends on the batch size you process and the number of vertex output attributes you expect. Cache size and management are very important for performance, but I don't want to talk about it here. This is because it is special to any hardware, but it has no depth. See you later.