Graphic Assembly Line Tour part2 GPU Storage Architecture and command processor

Last Update:2018-12-05 Source: Internet

Author: User

Tags gtx

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the previous part1, I explained the various stages that the 3D rendering command had taken before being actually processed by the GPU on the PC, and then dug a hole here with the instruction processor. OK. In this part, we will indeed encounter the instruction processor first, but you need to know that everything in the instruction buffer goes through the memory-whether it is the system memory or the Display memory. We use pipelines in order, so before entering the instruction processor, we need to spend some time talking about memory (memory and video memory ). Storage subsystem
GPUs does not have a storage subsystem with Standard Rules-it is not like the general-purpose CPU or other hardware you usually see, because it is designed for a variety of different usage patterns. There are two fundamental ways to see the difference between the GPUs storage subsystem and general machines. The first is that the GPU storage subsystem is very fast. A core i7 2600K can reach the memory bandwidth of 19 Gbit/s at the limit. The total memory bandwidth of a GeForce GTX 480 is close to 180 GB/s, which is almost an order of magnitude higher. The second is that the GPU storage subsystem is very slow. On Nehalem (first generation i7), a lack of high-speed cache and primary storage requires about 140 cycles per clock. As mentioned above, GTX 480 has a storage access latency of-clock. Measured in cycles, GTX 480 has a storage latency of about 4 times that of i7. The core i7 clock frequency mentioned above is 2.93 GHz, while the GTX 480 coloring clock frequency is 1.4 GHz-a two-fold gap. This is almost an order of magnitude gap. As you can see, GPUs can increase greatly in bandwidth, but it also needs to bear the increasing delay. This is part of the general model-GPU throughput is limited by latency. Don't get hurt here. Let's do something else. The above is almost everything you need to know about GPU memory. Of course, apart from the important DRAM that we will talk about next. DRAM chips exist in the form of 2D grids both logically and physically. It has a horizontal line and a vertical line. The intersection between each line is a transistor and a capacitor. If you want to learn more, check WIKI (http://en.wikipedia.org/wiki/DRAM#Operation_principle. In short, the key point is that the addresses at the last position of DRAM are separated by horizontal and vertical addresses. DRAM internal read/write usually traverses all columns of the row at the same time. This means that accessing the storage columns correctly mapped to a row on DRAM is much cheaper than traversing multiple rows to access the same amount of memory. These seem a bit like DRAM cold knowledge, but it will become very important later. In connection with the previous section, it is impossible for you to read a byte in the memory to reach the storage bandwidth limit. If you want to make the storage bandwidth saturated, read the DRAM in one row. PCIe host interface
From the perspective of graphic programmers, this hardware is a bit boring. In fact, the GPU hardware architecture has the same thing. It becomes a bottleneck when it is slow, and you have to worry about it. So what you need to do is to let excellent people look at it and ensure that nothing is done. In addition, it allows the CPU to read/write access to the memory and GPU registers, so that the GPU can read/write access to some primary memory. In fact, these are a headache, because the processing latency is even worse than the storage latency-the signal leaves the chip, enters the slot, passes through the motherboard, and then returns to a certain part of the CPU. Although the bandwidth is appropriate-the total peak bandwidth can reach 8 Gb/s through 16 lines
PCIe 2.0 is connected, and most GPUs are used, so only about 3 to 5 of the total CPU memory bandwidth is available. Unlike early standards such as AGP, PCI is a point-to-point serial link-bidirectional bandwidth. There is a expressconnect from the CPU to the GPU, but there is no reverse direction. Last point: memory fragmentation
We are now very close to 3D commands. However, we have to solve this problem first. There are two types of memory in front of us: (local) Display memory and mapped system memory (main memory ). One takes a day to travel north, and the other sets foot on PCIe avenue for a week to travel south. Which path do we choose? Early solution: add an additional address line to tell you which path to take. It is simple and practical, and has been used for many times. Maybe you are in a unified storage architecture, such as some game hosts (not PCs ). In that case, you don't have to worry about the problem of choice. memory is the end point you want to go. If you want to do more, you can add an MMU (Memory Management Unit), which provides you with a fully virtualized address space and allows you to demonstrate various techniques, for example, you can store the frequently accessed part of a texture in the video memory (fast speed), the other part in the system memory, and most of them are not mapped at all-just like a group of air. When the program runs and detects insufficient video memory, MMU allows you to fragment the video memory address space without actually copying things. Moreover, MMU makes it easy for multiple processes to share a single GPU. You can use an MMU, but I'm not sure if I have. In short, a MMU/virtual memory is not something you can actually find (unlike high-speed cache and memory in a hardware architecture), but it is not very special in some special stages. There is also a DMA engine that can replicate the memory without being pulled to any 3D Hardware/coloring tool core. In general, it can at least replicate between the system memory and the video memory (both directions are acceptable ). It often performs the copy operation from the video memory to the video memory (this is useful if you want to do the video memory fragment sorting), but it cannot replicate the system memory to the system memory. Because this is a GPU, not a memory replication unit-Do system memory replication on the CPU, where two-way PCIe is not required. The following figure shows more details. Today, GPUs have multiple memory controllers, and each controller can control multiple storage bodies. But they have to find ways to get the bandwidth. OK. To sum up, we have prepared a instruction buffer on the CPU, with the PCIe host interface. Therefore, the CPU can obtain these and write their addresses into registers. Our logic is to return data from the address-if it comes from the system memory via PCIe, but we want to get the instruction buffer in the video memory, then KMD can set a DMA conversion, in this way, you don't have to worry about the core of the coloring tool on the CPU or GPU. Then we can use the storage subsystem to obtain data from the video copy. Now that everything is ready, you can finally get a glimpse of the command. Finally, the instruction processor
The discussion on instruction processor is now in the beginning. Many things mentioned above are only for one thing: Buffering effectAs mentioned above, both storage paths will cause high bandwidth and high latency. For most of the subsequent "bits" in the GPU assembly line, you only need to run a large number of independent threads. But the problem is that we only have one instruction processor, and it needs to receive the provided instruction buffer in order (because this buffer contains commands such as many state changes and rendering, these must be correctly executed in order ). Therefore, what we need to do next is to add a buffer that is large enough and read forward far enough to avoid interruption. In this buffer zone, the processor will reach the real front-end of instruction processing. Basically, it is just a state machine that knows how to parse commands. Some commands process 2D rendering operations-unless there is a command processor used separately to process 2D transactions and the 3D end never sees it. However, a dedicated 2D hardware is still hidden in modern GPUs, just as a VGA chip still supports text mode, 4-bit/pixel bit mode, smooth scrolling, and so on. Good luck. I found these things to be eliminated without a microscope. These things do exist, but I will not mention them later. Then there are various commands that pass the primitive (primitives) to the 3D/colorant pipeline. Of course there are also 3 D/shader pipelines without any rendering instructions. These will be detailed in the next chapter. Then there are some commands for changing the status. As a programmer, you can think of changing variables. GPU is a calculator for massive parallel processing. You cannot change a global variable in a parallel system and expect everything else to work normally-if you cannot guarantee that everything is okay after you force the change, then there will be bugs in the end. In fact, there are many popular solutions, basically all chips will use different methods based on different status types.
At any time you change a state, you have to force the command to immediately end/complete all unfinished work related to this State (such as the Flush operation of a local assembly line ). In the past, most of the changes to the status of graphics chips were handled using this method-simple and inexpensive in the case of few batch processing, few triangles, and short pipelines. Of course, as the number of batch processing and triangles increases, the pipeline gets longer and the consumption of this method increases rapidly. This method is still in use, but it is limited to special solutions with low status change frequency or expensive/difficult.

You can make the hardware unit completely stateless. You only need to pass the status change command to the desired status, and then attach the status to all downstream current statuses in each cycle. The changed status is not stored cyclically. They pass through an assembly line stage and then rush to the next one. Therefore, if you only have a small number of bits in the status (not many changes in the status), and some pipeline stages need to view bits in the status, this operation is expensive and not practical. Of course, it is okay to set the texture sampling status of the entire active texture.
Sometimes, only one copy of the status is stored, and the phase changes the serialization transaction frequently. You have to Flush the copy every time you change it, but if you have two copies, it will be much better. In this way, the frontend of your status settings can obtain information in advance. Assume that you have enough registers (slots) to store two modes of each State, and the valid mode is set to the reference slot 0. You can safely change it to slot 1 without stopping or interfering with the job. Now you don't have to send all the State loop pipelines-you only need one single-bit command that uses slot 0 or 1. Of course, if the slots 0 and 1 are not empty when the status change command comes in, you will wait, but you can get this information one step in advance. More slots use the same technology.
You need to set a large number of statuses at the same time for the status of resource views such as the samplerger or the shadow. The problem is that you can't do it. You do not want to reserve 2*128 active texture state space only because two flight status sets may be tracked. In these cases, you can use a register rename solution-a memory pool containing 128 physical texture descriptors. Unless a coloring program requires 128 textures at the same time, the state change will be very slow. But in general, if an app uses less than 20 textures, you have enough dynamic margin to maintain the running of multiple state versions.

This list is not comprehensive-but the key point is that some things that seem as simple as changing variables may require an unusual amount of hardware support. Synchronization
Finally, the remaining instruction sets process synchronization between CPU/GPU and GPU/CPU. Generally, the form of explanation is "If X happens, execute y ". I will first solve the "execute y" section -- there are two good ideas about what y is: it may be a push-model notification ("Hello, CPU! I want to enter the 0 display mode in the vertical blank gap, so if you want to smoothly flip the cache, it is now !"), Or, it may be that the GPU writes down something first, and the CPU can postpone the pull-model transaction ("that, GPU, what is the most recent instruction buffer segment you processed? "-" Let me check it ...... Yessequence ID
303 "). The former is generally implemented by an interrupted operation, and is only used for a few cases with a high priority, because the interrupted operation is too expensive. All implementation of the latter is to write data from the instruction buffer to the GPU register visible to the CPU only when necessary. Assume that you have 16 registers. Then, assign currentcommandbufferseqid to register 0. First, assign a serial number to each instruction buffer to be submitted to the GPU, and then add the prompt "if you want to obtain this pointer in the instruction buffer, write it to register 0" at the beginning of each instruction buffer ". okay. Now we know which instruction buffer the GPU is processing. We also know that the instruction processor strictly implements/terminates the instruction in sequence. Therefore, if the first instruction in the instruction buffer 303 is executed, the sequence ID
All instruction buffers before 302 (including 302) have been implemented/ended, and can be recycled, released, or modified by KMD. Now we have an example of X: "If you get here" -- this may be the simplest example, but it is enough. There are other examples: "If all the pointers have been completed since (the Pointer Points to) before the pointer enters the instruction buffer) reads all textures in batch processing "(This indicates that the pointer pointing to the texture/rendering target memory is safe) and" If all active rendering targets/UAV have been rendered ", "If all operations are completed at this point," and so on. There are multiple ways to retrieve the value written into the Status Register, but I think the only normal way is to use an ordered counter. I didn't talk about random information that is irrelevant to the principle (what is a Sequence Counter ?), Because I think you should understand it. Here we have already talked about half of it-we can return the status from the GPU to the CPU, allows proper storage management on the driver (especially for memory that has been safely recycled for vertex cache, instruction cache, textures, and other resources ). But this is not all-we missed something. For example, what if we want to perform pure synchronization on the GPU end? Return to the rendering target example first. We cannot use it as a texture before the rendering ends. The solution is -- "wait": Wait Until register M has a value of N. In comparison, the cost may be equal, cheap, or expensive-I generally think it is equal. This method allows us to synchronize rendering targets before submitting a batch. It also allows us to make a global GPU
Flush operation. Everything is done. The GPU/GPU synchronization problem has finally been solved-there is a better one-grained synchronization when we introduce the computing coloring tool in dx11 ), this is usually the only synchronization mechanism on the GPU side. You need to pay more attention to regular rendering. If you want to write registers from the CPU side, you can also use another method-Submit a local command cache containing "and so on" specific value operations, and then change the register from the CPU instead of the GPU. This can be implemented by the d3d11 style multi-threaded rendering. You can submit a batch process that references the vertex/index cache locked on the CPU end. You only need to wait in front of the actual rendering call. Once the vertex/index cache is unlocked, the CPU can change the content in the register. If the GPU never gets such a command cache, "wait" becomes a blank operation. If it is obtained, it will take some time (instruction processor) to parse until the data is confirmed to exist. In fact, even if you do not have a CPU writable Status Register, you can do the same if you can modify the Instruction Cache after submission, as long as there is an instruction cache "jump" operation. Of course, you don't need this register/Wait Register model. For GPU/GPU synchronization, you only need to have a "rendertarget barrier" structure to ensure the security and availability of a rendering target, and add a "Flush everything" command. But I prefer this register-style model, because it can be one stone and two birds (reporting resources being used by the CPU and automatic GPU synchronization ). The chart below is a bit complicated. Let me briefly describe it. The basic idea is: the instruction processor has a FIFO in front, followed by the Instruction Decoding logic, and then through the two-dimensional unit, the 3D front-end (Standard 3D rendering), the coloring unit (computing coloring device) execute various modules for direct communication, followed by a block for processing synchronization/Wait commands, and a unit for caching and redirection/calling of processing commands. All the units allocated with work need to send the return completion event to us so that we can know when the texture is not used and their memory can be recycled. ConclusionIn the next section, we will start to get started with some actual rendering work. As a matter of fact, the pipeline has begun to have branches. If we run the computing coloring tool, the next step will be to run the computing coloring program. However, we do not do this because the computing shader is the topic of a very small part. Let's start with the standard (fixed) rendering pipeline. I am only talking about the general framework here. I have explained a lot of details for ease of understanding, and I can study it myself if I need to (interest) it.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More