Beware of GPU memory bandwidth
For personal use only, do not reprint, do not use for any commercial purposes.
Some time ago, I wrote a series of post-process effect, including the motion blur, refraction, and scattering of screen spance. Most shader is very simple. It is nothing more than rendering a full screen quad to the screen, usually no more than 10 lines of PSCodeAnd does not contain any branch or loop commands. You only need to run sm1.4. However, when you open multiple post processes at the same time on GeForce 7300gt for testing, it is found that the FPS is very bad. Check all the code to make sure that the optimization has been performed, but the problem persists. The final conclusion is the fill rate bottleneck. After all, when everyone explains the post process technology, they will mention the filling rate as the biggest bottleneck.
To test how many post effect files can be used at 60fps, I created a blank project and checked how many full screen quad files with textures can be rendered at 1024*768 resolution:
// C # code
Protected Override Void Draw (gametime)
{
Graphicsdevice. Clear (color. Black );
Effect. Begin ();
Effect. currenttechnique. Passes [0 ]. Begin ();
For ( Int I = 0 ; I < Maxquadcount; I ++ )
Quad. Draw ();
Effect. currenttechnique. Passes [ 0 ]. End ();
Effect. End ();
Base . Draw (gametime );
}
//Shader
Float2 screenparam;
Texture sourcetexture;
Sampler sourcespl: Register (S0) = Sampler_state
{
Texture = < Sourcetexture > ;
Minfilter = None;
Magfilter = Linear;
Mipfilter = Linear;
Addressu = Clamp;
Addressv = Clamp;
};
void fullscreenquadvs (float3 iposition: position,
out float4 oposition: position,
inout float2 texcoord: texcoord0)
{< br> oposition = float4 (iposition, 1 );
texcoord + = 0.5 / screenparam;
}
float4 filmscratchps (float2 texcoord: texcoord0): Color
{< br> float3 color = tex2d (sourcespl, texcoord ). XYZ;
return float4 (color, 0.02 );
}
Technique default
{
Pass P0
{
Vertexshader=Compile vs_1_1 fullscreenquadvs ();
Pixelshader=Compile ps_2_0 filmscratchps ();
}
}
The results were quite unexpected. when there were more than 20, the FPS would be lower than 60. At a resolution of 1024*768, 786432 pixels are required for each frame, 60 frames per second, and 20 rendering times per frame. The current filling rate is approximately 0.943 billion pixel/sec. but the filling rate of the 7300gt is 2.8 billion pixel/sec !!!. Of course, the standard specification is only a theoretical value, but the results are too different, only 1/3 of the standard!
What is the bottleneck? Is the actual performance of the hardware compromised, or is it caused by xNa internal code? I think the latter is more likely, So I posted a post on the Creator club and someone told me that this is mostly a bandwidth bottleneck.
So is it true? Calculate the bandwidth used by the above Code: each time a pixel is rendered, you need to read the rgba texture, 4 bytes, write the color into the backbuffer, 4 bytes, and one deep read and write, 2*4 bytes, which is 16 bytes in total. 60 frames per second, 20 rendering times per frame, about 15 Gb/s. Let's take a look at the bandwidth parameter of 7300gt, Which is 10.7 Gb/s. Obviously, data alone is indeed a bandwidth bottleneck. However, you must have noticed how much higher the actual performance is than the theoretical value? At this time, I am not sure whether it is a bandwidth bottleneck, so I changed the texture from the original texture (1024*768) to a 2*2 texture. Testing FPS again has greatly improved. Now, I am sure it is a bandwidth bottleneck.
You may wonder, why can we determine the bandwidth problem? How can we explain that the actual value is higher than the theoretical value? This is based on the GPU hardware architecture. In general advertisements or materials, we usually only see how much video memory the video card has. In fact, in addition to video memory, there is a high-speed cache on the chip like GPU and CPU, this cache is usually only a few kb. When the GPU cache hit fails, the access to the video memory will generate bandwidth traffic. For 2*2 textures, a maximum of 16 bytes can always be stored in the cache. By reducing the bandwidth and increasing the FPS, this is enough to show that the bottleneck we encountered was actually bandwidth. It also explains to some extent why our computing results are much higher than the standard parameters.
if you think that for full screen quad, you can disable deep read/write to improve performance, you may find another interesting problem: disabling deep read/write has almost no performance improvement! On the 8800gt, close and enable deep read/write, and the performance is only 10 FPS different. Where is the problem? Is the method for calculating bandwidth incorrect, or is the previous reasoning completely false? I have no definite answer to this question, but I still found some information that can be explained. For GPU, because color buffer and depth buffer need to be read and written frequently, they are actually stored in the video memory in a compressed format, and the hardware supports very fast compression and decompression, the specific algorithm and principle are complex. If you are interested, Google the algorithm. Here is a simple analogy to illustrate why compression can increase efficiency and reduce bandwidth: If we compress the 1024*1024 buffer into a 64*64 buffer, when the data written to this buffer is the same, for example, if the depth is 1, you do not have to write data in sequence like 1024*1024 pixels, you only need to mark 64*64 data blocks as 1 (that is why clear () is very fast on the modern graphics card ). The situation we encounter is just in the ideal state. The two triangles that make up QUAD are on the same depth plane and cover the whole screen, which makes deep read/write abnormal and fast, on the other hand, it explains that the actual bandwidth mentioned above is lower than the theoretical calculation.
So far, three conclusions are drawn:
1, 60fps, 1024*768 resolution, rendering of 20 full screen quad With textures has reached the limit of GT (although the hardware has been optimized to reduce bandwidth, this number is still disappointing ).
2. pixel fill rate is a very unreliable parameter, which is not only affected by the vertex processing capability (of course, because dx10 adopts a unified architecture, it does not have this problem), but also by bandwidth, sampling delay and many other factors. for modern Games, almost all the ry is textured, so before the fill rate bottleneck, it may have been subject to the bandwidth bottleneck.
3. Although the actual bandwidth may be lower than the theoretical value, it is still very prone to bottlenecks, especially on the lower-end video card. Although the computing power of the video card is growing rapidly every year, the increase in bandwidth is relatively slow: geforce 6800 ultra 35.3 g/s, 7950gt 44.8 g/s, 8800gtx 86.4 g/s, 9800gtx 70.4 g/s + (yes, you are not mistaken, the bandwidth of 9800 actually drops), 280gtx 141 g/s. only the top-level video card bandwidth of each series of NV is listed here. In fact, the bandwidth of low-end products is much lower.
In actual rendering, the computing bandwidth is much more complex than the above full screen quad, and all the following operations will bring data traffic:
1. Write Data to backbuffer
2. Open Alpha blend and read back buffer data.
3. Floating Point texture doubles the data volume
4. read/write depth/template Buffer
5. Read textures (next gen games usually need to use a large number of textures to render objects, which is the largest bandwidth consumer)
6. Enable trilinear mipmapping filtering. You may need to read 8 textures for each sampling.
7. vertex data (we have ignored vertices in the above discussion. In fact, vertices will also occupy a large amount of bandwidth)
Don't forget, in actual circumstances, the same pixel is very likely to be rendered multiple times (we cannot guarantee that it is always rendered back and forth), easily breaking through the traffic of dozens of GB. In addition, we also need to know that the bandwidth efficiency will never reach 100%, which is about 80% of the standard parameter.
Of course, there are some simple ways to reduce bandwidth usage:
1. Use mipmap and hardware-supported compression formats, such as dtx. This is the simplest and easiest method. Do not be confused by the first 6th items. for 3D rendering, mipmap can usually reduce the texture size submitted each time, and the compression format can obtain the best cache, the required data may be obtained after two reads.
2. Disable deep read/write and Alpha blend as much as possible.
3. Try to render from front to back.
4. perform additional depth pass to take full advantage of early-Z to delete invisible pixels.
5. Separate vertex data. For example, place the vertex position, normal, and texture coordinates in different streams. When rendering the shadow map, only Stream Containing the position is used.