Unity5 internal rendering optimization 1: Introduction, unity5 internal rendering Optimization
Translated from the aras blog, there are three articles in total to describe the process of unity5 optimizing its own Renderer
Learn from debugging and optimization experience
In our work, we form a group "strike team" to optimize the cpu part of unity rendering.
Basic information/parent warning in many cases, I have to say "This code is very bad! ". When you try to improve the Code, it is important that you want to improve the code.
In general, it does not mean that the code library is bad, or it cannot be used to make good things. In March this year, we had Pillars of Eternity, Ori and the Blind Forest and Cities: Skylines among top rated PC games, all of which were made using unity. "Mobile engine is only suitable for prototype design" is not bad yet.
The fact is that any code library, which has been developed and used for a long time and used by only a few people, is really bad in a sense. They are a strange part of the code. No one remembers how they work, because they were done many years ago and have no meaning or repair. In a large enough code library, no one can know all the details about how it works, so there are some conflicting decisions in some other subtle ways. Borrow someone's sentence "there are codebases that suck, and there are codebases that aren't used ":)
It is important to strive to improve the code library! We have been insisting on improving it. We have made a lot of improvements in all aspects, but frankly speaking, rendering code has improved a lot in recent years, no one has individually maintained and improved the code into a full-time job, but we have done it!
Several times I pointed to some code and said, "haha! That's stupid !" This is the code I wrote. Or code with various factors (lack of time, etc ). Maybe I was stupid at the time, maybe I would say the same thing in five years.
The desire to list the high throughput of the system without bottlenecks.
(Unity5.0) The current rendering, coloring machine running & graphics api cpu code is not very efficient. It has some problems that need to be solved as much as possible:
1. Graphics Accelerator (Gfx) device (Our abstract rendering API)
A. the abstraction mainly focuses on the DX10 Concept Design in DX9/section. For example, the constant/uniform (constant/uniform) buffer is not suitable now.
B. After a few years, it will become increasingly chaotic and need to be cleared in some places.
C. allow parallel command buffers to create modern APIs (such as using LES, DX12, Metal, or Vulkan ).
2. Rendering cycle
A. Many ineffective things and redundant decisions need to be optimized
B. Parallel Running cycles and/or their jobify sections. If possible, use native commands to buffer API creation.
C. Make the code simpler and more unified. Analyze common functions. More testability.
3. Operation of the coloring ER/Material
A. The data layout in the memory is not particularly good.
B. You want to clear complicated code. Increased testability.
C. The concept of "Fixed function shaders" should not exist at runtime. See [Generating Fixed Function Shaders at Import Time ].
D. text-based Shader format is stupid. See [Binary Shader Serialization]
Limits whether we optimize/clean up code, we should try to keep their functions working. Some rarely used functions or special circumstances may be changed or damaged, but this is only the final means.
Another thing to consider is that if some code looks complicated, it may be generated for the following reasons. One of them is "someone's code is too complicated" (Great! I want to simplify it ). Another possibility is that "some code is complicated for some reason in the past, but not now" (Great! I want to simplify it ).
However, it is also possible that the Code has done complicated things, for example, it needs to handle some tricky situations. It is easy to "Rewrite" the code from the beginning, but in some cases, once you start to make your new and good code do everything the Old Code does, it may become complicated.
We plan to add a piece of CPU code to improve its performance in several ways: 1) "Just make it faster" and 2) make it more parallel.
First, I want to focus on the section "Just make it faster. Because I also want to simplify the code and want to do a lot of tricky things. Simplify the data, make the data stream clearer, and make the code simpler. It is easier to do step 2 ("more parallel.
First, I will look at a higher level of rendering logic ("rendering loops") and material/material operations. Others In the Team will consider simplifying and processing the abstract rendering API, and try the "more parallel" method.
To test the rendering performance, we need some actual content for testing. I have watched several existing games and demos to limit their CPU usage (by reducing GPU load-low-resolution rendering; reducing polygon count and shadow pasters; reducing or eliminating post-processing; reduce texture resolution ). The CPU load is higher. I copied some scenarios, so it is more than the original rendering.
Rendering testing is very simple, for example, "hey, I have 100000 cubes !" However, this is not a realistic example. "A large number of objects using the same material" is a very different Rendering scenario, from thousands of materials with different parameters, hundreds of different pasters, changes in dozens of rendering targets, shadow pasters & Regular rendering, alpha mixing of objects, dynamic generation of ry, etc.
On the other hand, testing a "complete game" is also very troublesome, especially where it has to interact, loading all levels slowly, or without limiting the CPU.
It is helpful to test the CPU performance on multiple devices. I usually work on a PC Windows system (i7 5820 k), a Mac laptop (2013 rMBP), and iOS devices. I am now testing on iPhone 6. Testing on the console will be great. I 've always heard that they have great analysis tools, more or less fixed clock and cpu-but I don't have devkits. Maybe this means I should get one.
Next, I ran a standard project and read the analyzed data (including the inactive traps/tools of the unity analyzer and third-party evaluators). I also read the code to see what it does. At this point, every time I see something strange, I write it down for future investigation:
From Translation
SetPassWithShader can avoid the optimal PPtr dererf mode at some points. Now he seems to always do PPtr dererf, and then just call SetPass (another dererf ).
The material display list is constantly rebuilt> 1 pixel light. No more materials need to be stored in the list
GetTextureDecodeValues is called many times (something that creates a pixel light cookie) to end useless linear gamma conversion.
The material display list is constantly re-built to generate a chain reaction due to the unassigned global texture attribute (_ Cube, no value assigned ), when our properties are lost, we also need to find out why we failed and record it.
GpuProgramParameters: What does MakeReady do? Why distinguish it?
Why is STL maps used by attribute tables?
1. Use Only sound & simple data la s
2. Related strange things:
1. Why is TexEnv separated from the Attribute Table?
2. SetRectTextureID-why?
3. Part of texture pixel width/HDR decoding performance is executed in the device status
4. NotifyMipBiasChanged has many complicated things that are unknown.
IsPassSuitable is called again and again in a rendering loop. Maybe we need to create a direct pass pointer table in the rendering loop?
All textures are provided at one time to replace SetTexture at a time.
Arrange Attribute Table data in the memory to match the truly unchanged buffer layout. Several key word la s may be required
TextureID to long (64-bit in mac/linux) is specially submitted by ps4 3cbd28d4d6cd
1. It looks like optimization only on ps4, directly storing a pointer to TextureID. If you can, do this wherever you are, or just make it long (intptr_t is better) on ps4)
Why can ChannelAssigns and VertexComponent be used at all times? Seems useless
Rendering loop classification is very expensive. Enable and complete it based on Hash allocation (rendering cyclic allocation.
Some of the above defects may be caused by some reasons. In this case, I add annotations to explain them. There may be some reasons at the time, but now there is no more. In both cases, the log/comment function of source control is very helpful, and the author of the Code is asked why. Half of the list above may have been written in this way for many years, which means I have to remember these reasons, even if they "seem to be a good idea at the time ".
This is the introduction. Next time, what will be done on the list above!
Next article to be translated
------ Translated from wolf96 http://blog.csdn.net/wolf96