Graphics Pipeline Tour Part2

Last Update:2015-07-08 Source: Internet

Author: User

Tags gtx

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: "A trip through the Graphics Pipeline 2011" Translation: Sword of the past Reprint Please indicate the source It's not that fast .In the previous article, we described the various stages that render commands go through before being processed by the GPU. In short, it's more complicated than you think. Next, I'll talk about the command processor that I've mentioned, and what I've done with the commands buffer in the end. What? Where did you talk about this--to deceive you--. This is the first time that this article mentions the command processor, but remember that all command buffer accesses both PCIe and local memory through memory or system RAM. We'll go through the pipeline sequentially, so we'll talk about memory before we get to the command processor. Memory System&NBSP;GPU has no regular memory system, which is different from your common CPU or other hardware, because it is designed for a variety of purposes. On common machines You can see there are two essential differences: First, the GPU memory system bandwidth is fast and quite fast. The Core i7 2600K can barely reach the 19gb/s bandwidth. GeForce gtx480 's bandwidth is close to 180gb/s--, which is an order of magnitude! 2nd, the GPU memory system is slow and quite slow. The cache miss for Nehalem (first-generation Core i7) main memory is approximately 140 clock cycles, which is the data (AnandTech given) that is based on the clock frequency divided by memory latency. The memory latency of the GeForce GTX 480 I just mentioned is approximately 400~800 clock cycles, which is 4 times times more memory latency than the Core i7. In addition, the Core i7 clock frequency is 2.93GHz, and the GTX 480 shader clock Frequency table is 1.4ghz--that is, there are still twice times the gap. Wow, it's an order of magnitude! Damn, it's funny, I'm a little excited. It must be a trade-off and continue to listen. Yes,--gpu is a huge increase in bandwidth, but they have to pay a lot to increase memory latency (it turns out to be quite a drain, but beyond the scope of this article). The throughput of--GPU on this mode is limited by latency. Don't blindly wait for the result, do something about it! Above is what you need to know about GPU memory, except for DRAM anecdotes, which are also important: the DRAM chips are organized into 2D meshes-both logically and physically. There are (horizontal) lines and (vertical) row lines, each of these lines having a transistor and a capacitor at each intersection. If you want to know how to make memory with these materials, you can access (Https://tokyo.zxproxy.com/browse.php?u=V%2FmbvKGmGdz9QE4KTynGv27LALUJtfhzT4wC%2FQA%3D &b=14#operation_principle). In short, the point is that the address of the DRAM is separated by the row address and the column address, and the DRAM's one-time internal read/write always accesses all the columns of the trip. It is much less expensive to access all columns than on a row of memory to access the same number of multiline memory. This is only a small knowledge of DRAM, but it is very important for follow-up. Note: Look here again later. Link here, including the previous content, read only a few memory bytes, and can not reach the maximum memory bandwidth, if you want to saturate the memory bandwidth, you should read a full line of DRAM at a time. PCIe Host InterfaceAccording to the graphic Programmer's point of view, this part of the hardware is of little meaning. In fact, this is also the hardware architecture of the GPU. You also have to care about it because it's too slow to be a bottleneck. So you have to find someone who can fix it and make sure it's okay. In addition, it allows the GPU to read/write memory and a large number of registers, allowing the GPU to access (part of) the main memory. The annoying thing is that these transmissions have a worse latency than memory latency because they have to be signaled from the chip, into the socket, through the motherboard, and then to the CPU for a long time. The bandwidth is appropriate--up to 8gb/s peak on the 16-lane PCIe 2.0 interface, most of which is used by the GPU, while the CPU accounts for only 1/3~1/2 bandwidth. This ratio is possible. Unlike the early AGP, which is a symmetric point-to-point link-the bandwidth is bidirectional. AGP has a fast channel, from the CPU to the GPU, but the reverse is not possible. the last part of the memory tip To be honest, we are now very, very close to the actual 3D instructions we've seen! You're going to smell it. But there is one thing we need to solve. Because now we have two kinds of memory--(local) graphics and mapped system memory. One is to go north for a day, the other is to go south along the PCIe Highway a week's journey, choose Which? Simple solution: Just add an additional address line to tell you which way to go. This is simple, very effective and has been used for a long time. Or you might be on a unified memory architecture, such as some game consoles (excluding PCs). In that case, you don't have to choose, only memory is where you want to go. If you want to be a little better, you add a MMU (memory Management Unit RAM Management Unit) that gives you a fully virtualized address space and allows you to make good tricks, such as frequent access to textures in memory (which is fast), and other parts in system memory, And most of them are completely unmapped-just like they are, usually, reading a disk for 50 years, This is no exaggeration, access to memory is like a day, this is a hardware read time, this is quite fast. Fuckin ' Disk! I digress. ...... when your memory is not enough, the MMU can also defragment the memory address space without actually copying it. Good thing, it makes it easier for multiple processes to share the same GPU. Using an MMU is certainly possible, but I'm not sure if I need it, though it's pretty good (who's going to help me?). If I understand, I will update this article, but now I do not understand. In short, mmu/virtual memory is not what you actually add (unlike the cache and consistency memory in the architecture), and it's not specific to a particular stage-I'll mention it somewhere else and put it first. also has a DMA engine that can copy memory without involving our important 3D hardware/shader cores. Typically, it can be copied (bidirectional) at least between the system memory and the video memory. It is often used for video memory duplication (which is useful if you need to defragment the memory Ipian). It is usually not possible to make a copy of the system memory because it is the GPU, not the memory copy unit, which executes the system memory copy on the CPU without the bidirectional PCIe. I drew a diagram to show more detail-now that your GPU has multiple memory controllers, each of which controls multiple memory bars, they have to get bandwidth:) OK, to make a list. CPU side has a preset command buffer, with the PCIe host interface, the CPU can notify us and write registers. We have to turn the logic into an address loading and then return the data-if it's from system memory through PCIe, if we want to get command buffer in video memory. KMD will setA DMA transfer, regardless of the CPU or GPU, the shader kernel does not care about it. Then through the memory system can get to the video memory copy data, this is all we set up the process, and finally look at command buffer. Finally, the command processor.Before we start talking about the command processor, we've done a lot of work, summarizing it in one word: "Buffering." As mentioned above, the memory channel is high bandwidth and high latency. For most GPU pipeline subsequent bits, the workaround is to run multiple independent threads. But if we do this, we have only one command processor, and we have to think about the order of the commands buffer (because command buffer contains the correct queue for state changes and rendering needs). So the next thing we should do is add a buffer that is large enough to forward to avoid interruptions. In this buffer, the command processor can reach the actual command processing front end-basically a state machine that knows how to parse instructions (in hardware specification format). Some instructions handle 2D rendering operations--unless the command processor is individually divided into 2D, the 3D front end does not have to control it. In any case, the current GPU still has the ability to detect 2D of hardware, like the elimination of the VGA chip somewhere still support text mode, 4-bit/pixel bit plane mode, smooth scrolling and the like. It would be good luck to find these things out of the microscope. In short, these things still exist, but later I will not say them:) and then the actual processing of some elements in the 3d/shader pipeline (primitive) instructions. I'll talk about them in the next section. There are also some instructions in the 3d/shader pipeline for various reasons (and various pipeline settings) do not participate in rendering, will be said later. The next step is to change the state of the instructions. As a programmer, you can assume that they just change a few variables. But the GPU is a huge parallel calculator, and in a parallel system you can't just change a global variable and want it to work correctly--if you can't guarantee that everything is immutable, you'll end up with a bug. There are several common approaches that basically all chips use different methods for different types of states:

When you change a state, you have to end all the work involved (that is, flush part of the pipeline). In the past, graphics chips have been so processed-it's simple, with fewer batches, fewer triangles, and little overhead when the pipeline is short. This overhead increases with the number of batches and triangles. This approach is limited to changing frequencies that are not high (only part of the entire pipeline is not affected very much) or only for special requirements that are expensive/difficult.
You can leave the hardware unit completely stateless. Only pass the state change instruction to the specified stage, and then periodically append this phase to the current state. These states are not saved where-but always there, if the other stages of the pipeline want to know that these status bits are possible because the parameters have been passed in (and then delivered to the next stage). If your state changes only a few bits, it's not worth the cost. It would be fine to change the entire texture sampling state setting.
Sometimes only one copy of the state is saved, and each stage changes a lot of things to be refreshed. But if you save two copies (or four copies) then it's much better, so the front-end state setting can be advanced. If you have enough registers (slots) to store two copies per state, some active work with slot 0, you can safely modify slot 1 without stopping or interfering with the operation of the work. Now you don't send the entire state to the pipeline--there is only one instruction, choose to use slot 0 or 1. Of course, if slots 0 and 1 are in use, you'll have to wait until a state changes, but you can do it one step ahead of time. This technique uses more than two slots.
For the state of the sampler or texture resource View (Shader Resource view), you can set a large number of settings at the same time, but you will not do so. You're not going to keep state space for 2*128 textures just because you're tracking two of empty state sets. In this case, you can use a register rename scheme-a memory pool with 128 actual texture descriptions. If you really need 128 textures in a shader, it will be very slow to change state. But in most cases, an application with less than 20 textures, you have quite a lot of space to protect multiple versions.

These are not comprehensive-but the point is that changing a variable in your application may seem simple (even UMD/KMD and command buffer) might actually require a lot of hardware support behind it to ensure performance. SyncThe final part of the directive is to handle CPU/GPU and Gpu/gpu synchronization. Typically, these forms are "execute Y if event x occurs." I'll start with the "Execute Y" section, which is probably what the GPU tells the CPU to do now push notifications ("CPU ah, I'm going to go into the vertical blank gap of the display device 0 VBI, so if you don't want to void the buffer, now hurry to work!" "), or it may also be that the GPU only records what happened, and the CPU can later ask it (" Say, GPU, which command buffer fragment have you recently processed? "). "-" Wait for me to check Ah, the serial number is 303. The former is implemented by interrupts, and is used only in low-frequency, high-priority events, because the interrupt overhead is significant. After this, the value is written from command buffer to the CPU-visible GPU register every time the event is triggered. For example, you have 16 registers, and the register 0 is assigned the value currentcommandbufferseqid. Assign a sequence number to command buffer to be submitted to the GPU each time (this step is done in KMD), and then at the beginning of each command buffer, add the tag "register 0 if it arrives here." Look, now that the GPU knows which command buffer is being processed, we know that the command processor will execute all instructions strictly in sequence, so if the first instruction command303 is executed, it is until the command buffer with the serial number 302 has been completed, They can now be re-used, released, changed, or treated by KMD. about "If event x occurs" in what "event X" is, let's cite an example, such as "If you get here"-that's probably what it means. For example, "If in command buffer, shader reads the map of all render batches" (this means that the memory for recycling Texture/render target is safe), "If all active render target/ The UAV has been disposed of "(which means that they can actually be used safely as a texture)," if all the operations so far have been done, "and so on. These operations are often referred to as "fences", as the following table says, there are a number of ways to remove a written value from the status register, but I think the most reliable way to do this is to use a sequential counter (which may be borrowed from other knowledge). Yes, there are some concepts I didn't say, because I think you should understand. I'll probably explain in more detail: is half done-now it can be returned from the GPU to the CPU, allowing us to do the right thing in the driverMemory management (now you can see when you can actually safely reuse vertex buffer,command buffer, texture, and other resources). But it's not over yet-there's a pain in the drain. What if we need to synchronize purely on the GPU side? We go back to the example of render target, and we cannot use it as a texture until the actual rendering is complete (and when other steps occur – there are a lot of details about the texture units that have been used). The workaround is "Wait" instruction: "Wait until register m has value n". This can be a comparison operation that is equal to, less than, or more complex--simply to discuss the equivalent. "Wait" allows us to synchronize the render target before submitting the render batch. You can also allow us to build a flush GPU operation: "If the pending work is complete, set register 0 to ++seqid"/"Wait until register 0 has value SeqId". Gpu/gpu synchronization is all done-in DX11 's compute shader directive, there is a finer granularity of synchronization, which is the only synchronization mechanism on the GPU side. About rule rendering, you don't need to know too much. By the way, if you can write a CPU-side register, you can also use another method-commit a local comand buffer, include the previous special value, and then let the cpu-side override the GPU-side change register. This method can be used to implement D3D11 style multi-threaded rendering, you can submit a render batch containing Vertex/index buffer reference, the CPU will still be locked (may be written by another thread). You only need to send a wait instruction before the actual render call, and then once the Vertex/index buffer is unlocked, the CPU can change the contents of the register. If the GPU is confiscated, the wait instruction is an empty operation, and if it is received, it takes some (command processor) time to process it. Nice work, huh? In fact, if you change command buffer after you commit the command, you can implement this method even if there is no CPU-writable status register, as long as there is a command buffer "jump" instruction. Details left to the reader to think:) Of course, you don't need this setup register/wait register model; for GPU/GPU synchronization, you only need a "render target barrier" directive to ensure that the render target is safe to use, and that you need a " Flush all the stuff "instructions. But I prefer this model of setting the register style, because it can be stone (feedback to the CPU is using the resource, and the GPU is self-synchronizing). Here, I drew a picture. It's a little complicated, I'll tell you the details. The basic idea is this: the command processor begins with a first in, first out queue (FIFO) followed by the instruction decoding logic, which is performed directly by a variety of blocks such as the 2D unit, the 3D front end (normal 3D rendering), or the shader unit (compute shader), and a block that handles the synchronization/ Wait for the instruction (including the publicly visible register I have said), and a unit that handles the command buffer jump/Invoke instruction (changes the current prefetch address, to the FIFO). All units assigned to work need to be sent to us to complete the event, so we know when textures are no longer being used, and can reuse their memory. Concluding remarks Next, it is really touching rendering work. Finally, there are 3 parts left for the GPU, so let's start looking at the vertex data! (The triangle has not been rasterized yet.) It will take some time. In fact, at this stage the pipeline has branched out, and if we run compute shader, the next step will be the compute shader phase. But let's not talk about it, because compute shader is the back part! Start with regular rendering.

Graphics Pipeline Tour Part2

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More