Optimize 3D graphics Assembly Line

Last Update:2018-12-07 Source: Internet

Author: User

Tags pixel coloring

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When NVIDIA perfhud 5 launcher is used, it is obvious that the current CPU time and GPU time are not balanced, so optimization is considered.
The following is a summary based on nvidia's OGP.
OptimizationCodeIt is usually to identify the bottleneck and optimize the bottleneck. Here, we will not consider the optimization methods inside the CPU, but mainly record the bottleneck detection methods and optimization methods of the CPU-> GPU 3D rendering pipeline.
If you only want to optimize the CPU, you can use some auxiliary tools, such as inter's intel (r) vtune (TM) Performance Analyzer, Intel (r) thread profiler 3.1, AMD codeanalyst.
The optimization steps are as described above: 1: Identify the bottleneck, 2: optimize it.
The most common and most effective way to find the bottleneck is to find the core function, reduce its clock cycle and load, and check whether it is correct.ProgramPerformance has a major impact. Most of the optimization methods are to remove tasks that affect performance and allocate them to other idle tasks to balance the overall time consumption.
Let's look at the general process of a rendering pipeline.
1: system CPU reads geometric vertices from memory-> delivers to GPU high-speed vertex buffer-> GPU Vertex coloring-> GPU build triangle-> GPU matrix transformation-> GPU raster-> 3
2: The system CPU reads the texture information from the memory.-> delivers the texture information to the GPU video memory.-> 3
3: fragment coloring raster-> output GPU background buffer for rendering.
However, several modules may have bottlenecks.

1: limits the logical computing capability of the CPU.

2: Limits on the transmission capability from CPU to GPU memory
(1) vertex
(2) Texture
3: transmission bandwidth limit for GPU memory to high-speed buffer
(1) Texture transmission bandwidth limit (memory> high-speed buffer)
(2) bandwidth limit of the secondary node after the grating is completed (high-speed buffer zone-> video memory)
Note: The peak transmission bandwidth limit is not considered here, because this limit is extremely small.
4: Restrictions on the internal processing capability of the GPU high-speed buffer.
(1) The processing capacity of vertex transformation coloring is limited.
(2) Maximum number of vertices.
(3) create a triangle.
(4) grating restrictions.
(5) pixel coloring restrictions.
5: the memory size is too small.
6: video card memory is too small, and other hardware caps restrictions.

The above is the bottleneck in the common 3D graphics rendering pipeline, so we will determine the bottlenecks one by one. The simple method is to detect FPS.
Note 1: many bottlenecks may change due to hardware changes.
NOTE 2: the bottleneck performance of debug mode and release mode may not be the same.
NOTE 3: when viewing FPS, you must disable vertical synchronization.
1: change the color depth, 16-bit, 32-bit. This directly affects the rendering buffer size of the secondary node. If this parameter is modified, the FPS changes significantly, this is due to the 3.2 Gb/s transmission bandwidth limit.
Note: here we need to change the color depth of all rendered objects.
2: Change the texture size and size, and change the texture filtering method. If this item is modified, the FPS will change significantly, this is due to the limited transmission bandwidth of 3.1 textures or the limited transmission capacity of 2.2 textures.
Note: In texture filtering, point filtering speed> linear filtering speed> triangular area filtering speed> the cross-object filtering speed increases the FPS if the texture filtering method is changed, this is the limit of 3.1 texture transmission bandwidth. This step is the process of transporting texture data from the video memory to the GPU high-speed Texture buffer.
3: Change the desktop resolution. If this option is modified, the FPS may change significantly because of the 4.4 grating or 4.5 pixel shader restriction.
At this time, the number of pixelshader commands is reduced. If this option is modified, the FPS changes significantly because of the 4.5 pixel coloring shader restriction. If there is no major change, the 4.4 grating restriction is applied.
4: Reduce the number of vertexshader commands. If this item is modified, the FPS will change significantly because of the limitation of the 4.1 vertex transformation coloring capability.
5: Reduce the number of vertices and the transmission rate. If this parameter is modified, the FPS changes greatly because the maximum number of 4.2 vertices supports the limit or the 2.1 vertex AGP transmission capability limit.
6: If none of the above conditions are met, the CPU logic computing capability limit is 1.0.
Note: This item can also be determined based on nvidia perfhud to detect CPU and GPU idle time. If the GPU idle time is too much, it indicates that it is caused by the CPU computing capability or the AGP transmission capability.
This item can also be detected and determined simply by replacing the CPU without replacing the GPU.
7: Check the resource manager, CPU usage, and memory usage to see if the logical computing capability of 1.0 CPU is limited or the memory usage of 5.0 is too small.
8: The capsviewer provided by the dx sdk can be used to know the support of the video card for more accurate judgment.
9: Change apgp to 1x mode in bios. If this mode is changed, the FPS may change significantly because of the transmission capacity limit of 2.1 or 2.2.
10: reduce the GPU configuration for detection and determination. Pay attention to the following two items: one is to reduce the GPU running frequency and the other is to reduce the GPU memory performance and size, you can determine the GPU issues.
11: delete some codes that occupy a large amount of CPU efficiency, such as physical, AI, and logic involved in the game for greater pertinence.
12: Set the rendering switch for the role, terrain, static model, and shadow to identify the problem more clearly.

Optimization Method:
I. Overall optimization.
1: Reduce small batch jobs
(1) buffer more vertices in a vertex. (More than 1024 points are suitable)
(2) Less draw. (Try to render more triangles at a time to reduce the number of rendering times)
(3) combine as many smaller texture files as possible into one larger texture file to reduce the number of smaller texture files.
(4) use vertexshader to pack closely related ry. (Vs2.0 already has 256 4D vector constants)
2: Logical sorting Optimization
(1) try to sort vertices at the logic layer to reduce the re-arrangement in the GPU high-speed buffer.
(2) try to sort rendering objects by depth on the logic layer by screen> internal order to reduce unnecessary depth sorting.
(3) Use index strip or index list whenever possible
(4) Basic sorting of textures Based on rendering status and rendering objects
3: Reduce Unnecessary rendering (the CPU layer's basic binary quadro is not emphasized here)
(1) In multi-pass rendering, consult each rendering object on the first rendering pass. When the number of pixels rendered by the rendering object in the first pass does not reach the specified standard, then, pass will not render it later.
(2) repeated rendering (such as sun glare effects) needs to be counted. When the number is reached, rendering is stopped or distributed.
(3) determine the necessity of rendering for some complex model settings in the basic surround box.
4: Reduce Unnecessary waiting caused by thread lock
(1) The CPU locks a resource and waits for the GPU to render. In this case, the common practice is to wait for the GPU to render. During the middle stage, the CPU is often in the idle state. We recommend that you do other tasks for the CPU at this time, for example, make basic preparations for the next resource or perform logical processing.
5: Reduce or evenly distribute the CPU pressure (in fact, most programs are restricted by CPU logic computing)
(1) The CPU pressure may exist in the following aspects: AI, Io, network, and complex logic. These parts can be used to test the CPU bottleneck to determine the direction of optimization.
(2) optimization policy: it is better to reduce CPU pressure when the GPU is busy.
(3) Use Article At the beginning, I mentioned some tools to find unnecessary empty Assembly loops in the CPU and unnecessary idle CPU.
Ii. Local Optimization.
6: AGP transmission bottleneck
(1) When too much data is transferred from the CPU memory to the GPU memory via agp8x, we can choose the following methods for optimization.
[1] reducing the number of vertices
[2] reduce the number of dynamic vertices and use the vertexshader animation instead.
[3] Use APIs correctly and set correct parameters to avoid creating and managing dynamic vertices and texture buffers.
[4] determine the appropriate swap buffer, Texture buffer, and static vertex buffer sizes Based on hardware configuration attributes.
(2) Avoid disordered or irregular data transmission.
[1] The number of vertices must be an integer multiple of 32. (You can use vertex compression to decompress vertex data in vertexshader)
[2] Ensure the order of vertices. (After sorting and transmitting them at the CPU logic layer, nvtristrip can help us generate optimized and efficient ordered mesh vertex data)
(3) geometric mesh transmission at the API Layer
[1] for static ry, create write-only vertex buffering and write only once.
[2] for dynamic ry, a dynamic vertex buffer is created at the beginning of the program, and then the discard is initially locked for each worker for noovewrite instead of for discard. The time consumption of discard is not comparable to that of noovewrite.
[3] The basic principle is to create less buffers and reuse them to reduce the number of locks.
7: Bottleneck of vertex transformation transmission processing (because GPU has powerful vertex processing capabilities, there will be no bottleneck in vertex transformation, but if yes ..)
(1) Too many vertices
[1] using the details of the dashboard is generally enough to use 2-3 levels of dashboard.
(2) vertex processing is too complicated
[1] reducing the number of lights and lighting complexity (direction parallel light efficiency> point light efficiency> spotlight efficiency)
[2] reduce the number of vertex coloring machine commands, avoid more than 128 commands, and avoid a large number of branch commands
[3] logical sorting of vertices at the CPU Layer
[4] computing that can be performed in the CPU is performed in the CPU, and a constant is transferred to the GPU
[5] reduce and avoid mov commands in CG/HLSL. Even if you use it, pay attention to it.
8: In most cases, the 4.3 triangle setting limit and the 4.4 grating restriction will not become a bottleneck. However, this bottleneck may occur when the number of triangles is too large or when the data of each triangle vertex is too complex. In this case, reducing the total number of triangles and using vs or reducing the Z-cull triangle is an effective method.
9: the bottleneck of the pixel coloring machine (before dx7, all are fixed rendering pipelines. Generally, the calculation between the transport volume and the coloring machine is balanced, but dx8 starts to program the pipeline, the calculation amount of pixelshader increases, and the data transmission volume is usually relatively small .)
(1) Too many texture fragments to be processed are too large
[1] On the CPU layer, input data in the order of screen-> Inner Z-buffer and render the data in this order.
[2] When performing multi-pass rendering, consider disabling the special effect in the first rendering pass and letting the first pass take charge of the Z-buffer processing. In this way, you can avoid rendering unwanted texture fragments in the next pass.
(2) Processing of each texture segment is too complicated
[1] large segments of long coloring machine commands will greatly reduce inefficiency, and try to reduce the length of the coloring machine commands
[2] using vector operations and parallel co-issuing to reduce the number of commands.
[3] Simple texture and combiner combination commands that use pairing together.
[4] Use Alpha mixer to improve performance.
[5] The Shadow is also calculated with the level of granularity (in bytes.
[6] In dx10, we consider moving the vertex buffer to a pixel buffer.
(3) additional Optimization Methods
[1] fx_12 precision
[2] use the fp16 command
[3] Enable ps_2_a description when pixel_shader2.0 is used
[4] reducing temporary access to registers
[5] reducing unnecessary precision requirements
[6] Use shader of earlier versions as much as possible (but avoid using vs1.0, which has been abandoned by vs3.0)
10: bottlenecks caused by texture maps
(1) optimization method.
[1] avoid using triangular surface filtering and cross-object filtering in texture filtering. Except for special requirements, linear filtering can be done well.
[2] Even if the filter is used, the ratio of the opposite sex must be reduced. If the phase-specific filtering is used, triangular surface filtering can be minimized.
[3] reduce texture resolution and avoid using unnecessary high-resolution textures.
[4] reduce the texture color depth, such as the Environment texture and shadow texture. Use 16 bits as much as possible.
[5] texture compression is recommended. For example, the dxt format can effectively compress the texture, and the GPU supports the dxt format well.
[6] avoid using non-quadratic texture resources.
[7] When sharpening a texture, do not sharpen the texture by using the negative value of the SLS, which may cause distortion in the distance. Try to sharpen the texture by filtering the different colors.
[8] for dynamic textures, we recommend that you use d3dusage_dynamic d3dpool_deafault to create a buffer and use d3dlock_discard to lock it. Try to lock it multiple times at a time and do not unlock it frequently. In addition, never read such a texture.
11: bottleneck caused by swap Buffer
(1) Optimization Methods
[1] Close Z-write as much as possible. In general, a full Z-buffer processing can be performed in a rendering pass. In the subsequent pass, Z-write should be disabled, don't worry, even if you need alpha-mixed objects, you don't need to enable Z-write any more.
[2] Start alphatest as much as possible. In fact, this operation will improve the efficiency rather than decrease.
[3] avoid using floating-point swap cache.
[4] If deep buffer is not enabled, use a 16-bit zbuffer.
[5] avoid using rendtotexture, or reduce the rend size.
For the current programmable pipeline, this means that we have a greater degree of freedom to implement more special effects, but it also has more bottlenecks and complexity, when encountering problems, we need to correctly identify the bottlenecks and use our brains to optimize them to balance the load between various links. So that all links are not overloaded and free.

For more information, see NVIDIA's gpu_programming_guide, which is translated into the essence of GPU programming. Above.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More