Nvidai and ATI graphics cards, detailed GPU Workflow

Source: Internet
Author: User
Directory:


Chapter 1: Introduction to GPU workflows of the Second Generation and later generations

Chapter 2: directx8
And Traditional pipeline of directx9 GPU

Chapter 3: vertices and pixel operation commands

Chapter 4: execution of traditional GPU commands

Chapter 5: Unified rendering Architecture

Chapter 6: g80 and
R600 unified rendering architecture implementation

Chapter 7: Performance Comparison Between g80 and r600

Chapter 8: awkward mid-range
-- Analysis of geforce8600


Chapter 4
I will first briefly introduce the core of the directx8/9 video card-graphics processing unit GPU workflow and instruction processing. Chapter 1 discusses the unified rendering architecture and the next generation
Directx10
GPU features: the specific implementation of the architecture of g80/geforce8800 and r600/radeonhd2900xt and their differences. The most important
Perform simple analysis on the geforce8600.

Chapter 1: Introduction to GPU workflows of the Second Generation and later generations


Simple (not necessarily scientific): GPUs mainly process 3D graphics-generating and rendering graphics.


The GPU graphics (processing) assembly line completes the following tasks: (not necessarily in the following order) vertex processing: At this stage, the GPU reads vertex data describing the appearance of a 3D image and determines a 3D image based on the vertex data.
Shape and location relationships to build the skeleton of a 3D image. In GPUs supporting dx8 and DX9, these operations are implemented by hardware Vertex
Shader (fixed-point shader) is complete. Raster computing: The images actually displayed on the display are composed of pixels. We need to convert the points and lines in the image generated above to the corresponding pixels through some algorithms.
Point. The process of converting a vector image into a series of pixels is called grating. For example, a mathematical diagonal line segment is eventually converted into a tiered continuous pixel. Texture post graph: Multilateral vertex unit generation
Only forms the contour of a 3D object, while the texture ing (texture
Mapping) the work is done on the Multi-deformation surface of the post map, in general, is to paste the surface of the polygon with the corresponding image, so as to generate a "real" image. TMU (texture
Mapping
Unit) is used to complete this task. Pixel processing: In this phase (during raster processing of each pixel), the GPU completes pixel computing and processing to determine the final attribute of each pixel. Support
In GPUs with dx8 and DX9 specifications, these operations are implemented by hardware Pixel
Shader (pixel shader) completes the final output: The drop (raster engine) completes the final output of pixels. After one frame is rendered, it is sent to the buffer zone of the video storage frame.

Summary:
In general, GPU is used to generate 3D images and map images to corresponding pixels.
Calculate each pixel to determine the final color and complete the output.

Chapter 2: traditional pipelines of directx8 and directx9 GPU

The previous workflow has actually explained the problem. In this chapter, we will summarize the following. Traditional GPU features can be divided into vertex units and pixel assembly lines. vertex units are implemented by several hardware.
Vertex shader. The traditional pixel assembly line consists of several groups of PSU (pixel shader
Unit) + TMU + drop. As a result, the traditional GPU generates polygon from vertex units, and the pixel assembly line is responsible for pixel rendering and output. The pixel assembly line needs to be described as follows: Although traditional
The pipeline is considered to be 1psu + 1tmu + 1rop, but this ratio is not constant. For example, in the radeonx1000 (excluding x1800) series, it is widely regarded as the 3:1 gold.
Architecture, PSU: TMU: the number of drops is. A typical x1900 video card has 48 PSU, 16 TMU and 16 ROP. This design method is used
In today's games, the number of pixel commands is much greater than the number of texture commands. With this outstanding architecture, ATI successfully defeated geforce7 and achieved 3D performance in the later stages of DX9.
Lead.

Summary:
Traditional GPUs are generated by multiple vertex units
Edges, pixels are rendered and output in the pixel assembly line. A pixel assembly line contains PSU, TMU, and drop (some data does not contain drop-down). The ratio is usually, but not fixed.

Chapter 3: vertices and pixels operation commands GPU executes corresponding commands to complete vertices and pixels
Operation


Those familiar with OpenGL or direct3d programming should know that pixels are generally described using four channels (attributes) of RGB primary colors and Alpha values. For vertices, XYZ is usually used.
And w
It is described by four channels (attributes. Therefore, the execution of a vertex and pixel command usually requires four computations. Here we use this command as a 4D vector command (4 dimensions ). Of course, not all commands are
In actual processing, there will also be a large number of 1D scalar commands and 2D and 3D commands.

Summary:
Since fixed points and pixels usually use 4 tuples to represent attributes, vertex and pixel operations are usually 4D vector operations, but there are also scalar operations.

Chapter 4: execution of traditional GPU commands


Traditional GPU Based on SIMD architecture. SIMD is Single Instruction Multiple Data, with multiple data commands.

In fact, this
It is easy to understand that the traditional Vs and the ALU in PS (arithmetic logical unit, usually each vs or PS has An ALU, but this is not certain, for example, G70 and r5xx have two) you can
Within a period (that is, at the same time. For example, if you run a 4D command, the PS or ALU in vs calculates the four attribute data corresponding to the specified point and pixel of the command. Then
Is the origin of SIMD. This ALU is now called 4D Alu. Note that 4D
Although the SIMD architecture is suitable for processing 4D commands, the efficiency of 1D commands will be reduced to 1/4. At this time, 3/4 of ALU resources are idle. To increase PS vs run 1d
Resource Utilization in 2D 3D commands. In the directx9 era, GPUs generally use 1D + 3D or 2D + 2d
Alu. This is the co-issue technology. This type of ALU still performs the same performance as the traditional ALU for 4d commands, but when 1D 2D
The validity rate of 3D commands is much higher, for example:

Add r0.xyz, R0, r1

// This command adds the X, Y, and Z values of R0 and R1 vectors to R0.

Add r3.x, R2, R3

// This command is used to add the W value of the R2 R3 vector and assign the value to R3.

For traditional 4D
Alu, obviously it takes two cycles to complete. The ALU utilization rate in the first cycle is 75%.
, The second cycle utilization rate is 25%. For 1D + 3D ALU, these two commands can be integrated into a 4D command, so it only takes one cycle to complete, and the ALU utilization rate is 100%. However, when
However, even if co-issue is adopted, Alu utilization cannot reach 100% in total, which involves the issue of parallel relevance of commands, and, more intuitively, the preceding two commands obviously cannot be
2d + 2d ALU is completed in one cycle, and the two 2D commands cannot be completed in one cycle by 1D + 3D Alu. Traditional GPUs are obviously not flexible in processing non-4D commands.


Summary:
In traditional GPUs, the fixed point and pixel processing are completed by Vs and PS, respectively.
Usually, a 4D ALU can perform 4D vector operations in a cycle, but this ALU performs 1D 2D
3D operation efficiency is low. To make up for this, Alu in DX9 video card is usually set to 1D + 3D 2D + 2d.

Chapter 5: Unified rendering Architecture


Compared with DirectX 9, the biggest improvement of the latest DirectX 10 is to propose a unified rendering architecture
Shader. Traditional graphics GPU has always adopted a separate architecture. vertex shader and pixel
Shader to complete, so when the GPU core design is complete, the number of PS and VS is determined. However, different games have different processing needs for the two. This fixed proportion of PS
The vs design is obviously not flexible enough. To solve this problem, the directx10 specification proposes a unified rendering architecture. Both vertex data and pixel data have many similarities in computing. For example
For example, they are all 4D vectors, and there is no separate floating point operation in Alu. These provide the possibility for the implementation of unified rendering. In a unified rendering architecture, PS units and vs units
All are replaced by common us units. NVIDIA's implementation calls it streaming.
Processer is a stream processor. This US unit can process vertex data and pixel data. Therefore, the GPU can be flexibly allocated based on actual processing needs, effectively avoiding
In the traditional separated architecture, the workload of Vs and PS is uneven.

Summary:
Unified
A rendering architecture uses us (usually SP) units to replace the traditional fixed number of Vs and PS units. Us can complete vertex operations and pixel operations, therefore, you can flexibly score points based on game needs.
Configuration to improve resource utilization.

The
Chapter 6: Implementation of the unified rendering architecture of g80 and r600


In the following, we will focus on the unified coloring unit of g80 and r600, regardless of the texture unit, drop, and other factors.

G80
In the GPU, 16 sets of 128 uniform scalar colorers are arranged, called Stream processors. We will refer them to SP later. Each SP contains a full-featured 1d

Alu. This ALU can complete the multiplication and addition operation (MADD) within a period ). Some people may have noticed that in the traditional GPU, both Vs and PS alu are 4D, but here, each sp
All ALU values are 1D scalar Alu. That's right. This is the MIMD (Multi-instruction and multi-data) architecture mentioned in many documents. g80 adopts a complete scalar route to split ALU into the most basic
1D
Scalar ALU and 128 1D scalar SP are implemented. Therefore, a 4D vector operation completed in a traditional GPU takes four cycles to complete in this scalar sp, or, 1 4D operation
Four SP parallel processing is required. The biggest benefit of this implementation is the flexibility, whether it is 1d, 2d, 3D, 4D commands, g80 is cheaper, all of which are split into 1d commands for processing. Commands and vector operations
The split is the same.

For example, add r0.xyzw, R0, R1 r0, and R1 vector in a 4D vector command.


The g80 compiler splits it into four 1D scalar operation commands and assigns them to four SP:

Add r0.x, R0, r1

Add r0.y, R0, r1
Add r0.z, R0, r1
Add r0.w, R0, r1


To sum up, the g80 architecture can be described using 128x1d.

The core structure of g80 is as follows:





The implementation method of r600 is very different from that of g80. It still uses the SIMD architecture. Four groups of 64 stream processors are designed in the r600 core, but each processor has one 5D
Alu, in fact, should be five 1D more accurately.
Alu. Because the ALU in each stream processor can be any one + 1 + 1 + 1 + 1 or 1 + 4 or 2 + 3 (in the past, the GPU can only be 1D + 3D or 2d + 2d ). ATI
Call these ALU streaming Processing
Therefore, ATI claims that the r600 has 320 SPUs. We consider that each r600 stream processor can execute only one instruction per cycle, but the stream processor has five 1d
Alu. To improve ALU utilization, ATI adopts the VLIW System (very large instruction
Word) design. Merge multiple short commands into a group leader and hand them over to the stream processor for execution. For example, r600 can merge five 1D commands into a group of 5dvliw commands.

Pair
In the following commands:

Add r0.xyz, R0, R1 // 3D
Add r4.x, R4, R5
// 1d
Add r2.x, R2, R3 // 1d


R600 can also be integrated into a VLIW command in one cycle.

In summary, the r600 architecture can be described in 64x5d mode.


The core structure and SP structure of r600 are as follows:


Summary:
G80 completely quantifies the operation, with 128 built-in 1D scalar sp, each of which has a 1D

Alu processes one 1D operation every cycle. For 4D vector operations, it is split into four 1D scalar operations. The r600 still adopts the SIMD architecture and has 64 sp, each of which has 5 1d

Alu. Therefore, it is usually claimed that the r600 has 320 PSU, and each SP can only process one command. ATI uses the VLIW system to convert the short instruction set into a long VLIW command to improve resource utilization. For example
For example, five 1D scalar commands can be integrated into a VLIW command and sent to the SP for completion in one cycle.

Chapter 7: Performance Comparison Between g80 and r600


From the previous chapter, we can see that r600's ALU scale 64x5d = 320 is significantly larger than g80's 128x1d = 128, but why is it based on r600 in actual testing?
Radeon HD
X2900xt has not achieved performance advantages over g80/geforce8800gtx? This chapter will try to find the answer from the differences between the two stream processor designs. For the Texture unit
Do not focus on storage bandwidth. In fact, the memory bandwidth of the r600 must be greater than that of the g80.

We will explain the problem in terms of frequency and execution efficiency:

1. Frequency: g80 has only 128 1D stream processors, which is at an absolute disadvantage in terms of scale. Therefore, NVIDIA adopts the shader frequency and core frequency Asynchronous Method to improve performance.
Although the core frequency of geforce8800gtx is only 575 MHz, the shader frequency is as high as 1375 MHz, that is, the SP operating frequency is more than twice the core frequency, while the r600
Shader and core synchronization are adopted relatively conservatively. In radeonhd2900xt, both are 740 MHz. In this way, the shader frequency of g80 is almost
Twice the r600, so it is equivalent to doubling the number of g80 SP to 256 at the same frequency, which is much closer to the number of r600 320. When processing the multiplier (MADD) command
The theoretical peak floating point operation speed of the 740mhz r600 is: 740 MHz * 64*5*2 = 473.6 gflops
The floating point operation speed of the g80 with a shader frequency of 1350mhz is: 1350 MHz * 128*1*2 = 345.6 gflops. The difference between the two is not similar to that of the SP scale.
That's big.


2. execution efficiency: although the shader frequency of g80 is very high, even async cannot return the theoretical operation speed gap due to the disparity in number. Therefore, we need to find the answer from the two stream Processors
Body Design. In g80, each vector operation is split into 1d scalar operations and assigned to different SP for processing. If you do not consider the concurrency of commands and other issues, all sp operations of g80 are
Fully utilized. R600 is not so lucky, because each stream processor can only process one command at the same time, So r600 should merge short commands to make full use of the 5dalu computing resources in the SP
VLIW command, but this merge is not always successful. Currently, there is no information indicating that the r600 can split and reorganize the commands. That is to say, the r600 cannot splice the appropriate commands into the 5D
To load his 5D
Sp. In this case, we assume that the processing of pure 4D commands cannot be split and reorganized. r600 each SP can only process one 4D command with a utilization rate of 80%. For g80, the command is split into 1d.

Operation, which can be used 100% at any time. In addition, the r600 structure has high requirements on the compiler. the compiler must try its best to find the parallelism in the shader instruction and splice it into a proper long finger.
And g80 simply needs to be split. In addition, it should be noted that each SP in the r600 has 5 1d
ALU is not fully functional. According to relevant information, only one of the five ALU groups can perform function operations, floating-point operations, and multipy operations, but cannot perform add operations, the remaining 4 roles
Madd operations. Each 1D ALU of g80 is fully functional, which also affects the efficiency of r600 to a certain extent.

Summary:
Although the alu scale of the r600 is much larger than that of the g80, the running frequency of the g80 SP is almost
The r600 is twice the same, and the g80 architecture adopts fully quantified computing, with higher resource utilization and higher execution efficiency. Therefore, the overall performance is not inferior to r600.

Chapter 8: awkward mid-end -- Analysis of geforce8600

Among the new-generation mid-range graphics cards, NVIDIA's g84-geforce8600 series was the first to be released. However, compared with its high price, its performance is actually
Unsatisfactory. In many tests, the cost is lower than the previous generation's high-end graphics card geforce7900gs. This chapter briefly analyzes the SP processing capabilities of the g84 core based on the conclusions discussed above.
Analysis. G84 is a highly simplified version of g80 core. The number of SP files decreases from 128 to 32 in g80, and the memory width also drops to 1/3-bits. Aside from the memory width and TMU
Drop, we focus on SP, g84
The SP frequency is also different from the core frequency. For example, 8600gt, the core frequency is only 540 MHz, but the shader frequency is as high as 1242 MHz, that is, the core frequency is more than twice. We roughly press twice
Note: The g84 core is equivalent to the 64 (1d scalar) synchronized by the core shader)
SP, while the traditional Vs and PS in alu are 4D, so it can be said that the computing power of g84 is equivalent to the traditional Vs and PS graphics card with a total of 64/4 = 16, a rough comparison, it and
Geforce7600 (PS + Vs = 17) has similar computing power. But of course, this is a problem because in g7x, each PS has two 4D
Alu, thus 7600 of the computing power is higher than the traditional PS + Vs = 17 graphics card.

The following calculation illustrates the problem:


(MADD operation) for 7600gt, VS is 4D + 1d PS is 4D + 4D core frequency 560 MHz
Theoretical peak floating point operation speed: 560 MHz * (12*(4 + 4) + 5*(1 + 4) * 2 = 135.52 gflops. For 8600gt:
1242 MHz * 32*1*2 = 79.4gflops it can be seen that the peak computing speed of 8600gt is much lower than that of the previous generation 7600gt, not to mention 7900gs.
However, due to the limitation of the traditional architecture, the g7x load is basically impossible. The actual computing speed of g7x is much lower than the theoretical value. For the g8x architecture, the execution efficiency is much higher.
The calculation speed is closer to the theoretical limit. In addition, the number of g8x registers supporting sm4.0 is much higher than that of g7x, and many efficiency advantages make geforce8600gt only rely on a small number of SP
It is enough to beat the upper-end 7600gt. However, as a dx10 video card, simply defeating 7600gt is obviously not the final goal. It is only 32sp, and it shows performance in dx10 games with unprecedented computing requirements.
Very poor, cannot meet the requirements of players.

Summary:
8600gt
The goal of replacing the 7600gt in terms of performance is barely achieved by virtue of the efficient unified rendering architecture, but a small number of SP makes it difficult to beat the high-end of the previous generation, let alone smooth operation.
The dx10 game is over, and the high price makes it more difficult. In the final analysis, NVIDIA's stingy g84 SP quantity and high price positioning have created GEF-orce
The embarrassment of 8600gt. Therefore, in the current situation, the 8600 series is obviously not as cost-effective as geforce7900 and radeonx1950gt.

 

 Original article address

Http://bbs.tyloogaming.com/redirect.php? Tid = 838 & goto = lastpost

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.