1. R600 3D Engine
The R600 core is a very important GPU core of AMD, which introduces a unified processor architecture, with registers and instruction sets that are completely different from the previous GPUs and have a big difference in their programming.
Figure 1 shows the hardware logic diagram of the R600 GPU core, the R600 GPU contains a parallel data processing array (DPP array), a command processor, a memory controller, and other logical parts, R600 's command processor reads the driver to write commands and parses the commands, R600 also the hardware-generated "soft interrupt" Sent to the CPU. The R600 memory controller is able to access all the memory on the R600 GPU core (VRAM memory, or local memory) and the user-configured system memory (GTT memory), and the R600 GPU also accomplishes the function of the DMA controller in order to meet the needs of GPU read-write.
Programs running on the CPU cannot write directly to the local memory of the R600 GPU, but the CPU program can command R600 to copy data or programs to R600 local memory or copy data from local memory to system memory. The complete program that can run on R600 consists of two parts: a program that runs on the host (CPU), a program running on the R600 processor, part of the program called the shader program when processing graphics applications, and a program called Kernel when the GPU is used for general computing.
Figure 1
This section is all from the AMD manual "R600 instruction Set Architecture".
2. R600 Graphics Line
This section is from the document "Radeon r6xx/r7xx acceleration", and it is a good idea to read this document before reading the 3D graphics pipeline.
The input data is largely based on the process of vertex processing, element assembly, rasterization, fragment processing, and output flowing through the graphics hardware, with the addition of geometry shader and tesselation shader in shader Model 4.0, These two parts are not considered for the time being. Figure 2 is the graphics pipeline for AMD R600:
Figure 2
command Processing
The command processor stage handles the command flow in the ring command buffer and the indirect buffer, which typically produces a series of write register activities when the command processor processes these command flows (some registers are read and written, and some register can only be used in the form of a command stream and unreadable). The drive is set to index buffer (GTT memory) and the index buffer is notified to the hardware, the vertex assembly and subdivision (vertex grouper and Tesselator, VGT) is triggered by the command, based on the index The address of buffer sends the index data to the SPI (shader pipe interpolator, which will allocate resources for shader), and sends the connection information of the entities to the entity Assembler (primitive assembly).
vertex processing
All shading processing (shader processing, or various operations performed on the GPU cores) is done in the unified shader block (Unified shader block), and the unified shader block contains sequencer (SQ, Control shader run) and shader pipe (SP) modules, each shader program has access to a number of general-purpose registers that are dynamically allocated by the shader program (SPI) before it is run, and the SPI loads the appropriate parameters into those registers. These parameters include the base address of the vertex data. The SPI then executes the process for the SQ launcher, and the first thing the shader program needs to do is take the vertex data and then run the shader program for that vertex, and the output of the vertex processing shader is placed in the shader output cache (shader export, SX) (since the pixel process and the vertex process run on the same hardware, the output of the pixel shader is also placed in the SX), the output of the R600 's vertex processing process consists of two parts: the coordinate information of the Position Cache placement vertex, Parameter Cache to place additional property information for the vertex.
In the absence of geometry shader, the processing of vertex data follows the following process:
- VGT \footnote{The contents of this section are translated from r6xx r7xx acceleration.pdf} A pointer to index buffer (in the case of immediate mode, the index buffer is temporarily specified by the hardware?). ), VGT iterates through all the indexes and sends them to the SPI.
- The SPI makes up the index data in its input cache into a vector called wavefront (wavefront up to 64 vertices)
- When the wavefront is ready, the SPI allocates the GPR (Universal Register, all general-purpose registers are 32*4=128bit, based on the size of the drive provided (the driver writes to the SQ_PGM_RESOURCE_VS register). Can hold a four-dimensional floating-point vector) and thread space (thread spaces), then these indexes are put into GPR (the ID of the GPR is assigned by WHO), shader core is notified that a new wavefront is ready;
- Shader Core runs a vertex handler for each vertex on the wavefront
- The vertex processor extracts vertex data based on the index in the GRP (using the fetch instruction or a separate fetch program)
- Vertex data is taken to GPR
- Shader other parts of the program to continue running
- The shader program allocates space in SX's position cache and outputs the vertex's coordinate information (XYZW) to this space
- The shader program allocates space in the parameter cache of SX, sends other attribute information (color, texture) of the vertex to this space, and the program exits
- SPI is told that all vertices of a wavefront are processed and the SPI releases the GPR
After configuring the render state, the above procedure is transparent to the user (driver) except for the 4th step, and the 4th step is performed according to the process of the user-written shader program.
Fragment Processing
When the vertex processing is complete, the vertex data is sent to the entity assembler (PA) for element Assembly (Note that procedure 1 has sent the vertex connection information to the entity assembler), and the output of the PA is sent to the scan Converter (scan CONVERT,SC) for the scanning conversion (rasterization process, Difference calculation), SC checks the "depth cache" (Depth buffer,db) to determine the availability of the fragment, which is early Z, re-z, and hiz processing (so to understand, SC check Z buffer, if the depth value of the fragment is greater than the Depth buffer The value is also larger, then the fragment is obscured, in the case of no open blending, this fragment can be thrown away, follow-up will not be processed, if the blending is turned on, then the subsequent processing). The rasterized fragments are sent to the SPI and then into the shader core for final fragment processing. The fragment processor takes textures, ALU calculations, and memory read and write operations. Upon completion, the geometry information of the fragment (coordinates in the screen coordinate system and depth values) and color information is sent to the DB and CB for final processing via SX (vertex shader also outputs to SX).
After the vertex data passes through the vertex processing process, it goes into the rasterization stage, which is the difference of the property data by Scan converter, which forms the fragment data, and the fragment data passes through the fragment processing stage to form an optional pixel (or fragment fragment).
Because R600 is a unified processor architecture, vertex processing and fragment processing are performed on the same hardware, so the process of fragment processing is similar to the process of vertex processing. The R600 fragment processing phase, including the rasterization phase, follows the following process:
- The entity Assembler (PA) reads the coordinate information of vertices from the position buffer of SX, reads out the vertex connection information from the VGT, and with these two information, it can assemble the elements.
- The assembled entities are sent to SC for a preliminary scan conversion; (Preliminary scan conversion does what work, dividing large primitives into small tile tiles)
- Initial scan converted out of the block (tiles, there is a suitable translation?? ) is sent to the SPI for final interpolation
- SPI allocates GPR and thread space (thread spaces) (depending on the size specified by the drive);
- SC and SPI read the attribute data of the vertex from the parameter cache of SX;
- SPI calculates the properties of each pixel interpolated for vertex properties
- Load the interpolated properties into GPR
- Shader Core is told a pixel wavefront arrives, ready to execute pixel Shader
- Shader Core runs the end of Pixel Shader,pixel Shader for each fragment inside the wavefront contains instructions for outputting fragment properties (colors) to SX
- SPI is told that all fragments within wavefront are processed, SPI releases GPR and thread space
The Pixel shader program outputs the computed results to SX and will be sent to the specified render target, which can be configured up to 8 render target at a time.
the final rendering
The output of the Pixel shader is placed into the DB and CB for final processing (corresponding to the raseter operation and merging processes in Figure 1, figure 2, and Figure 3), which includes alpha testing, deep testing, and final fusion (blending).
Graphics systems in "original" Linux environments and AMD R600 graphics Programming (9)--r600 graphics 3D engine and graphics pipeline