Unity5 optimization of internal rendering 1: Introduction

Source: Internet
Author: User

A total of 3 articles from the Aras blog, describing Unity5 's process of optimizing its own renderer
Learn from the experience of commissioning and optimization of great God

At work we form a team of "strike Team" to optimize the CPU portion of unity rendering.
basic information/Parent's warningIn many cases, I would like to say harshly, "This piece of code sucks!". When trying to improve the code, of course you want to improve the bad place, this is the point.
In general, it doesn't mean that the codebase is bad, or it can't be used to make good things. Just this March, we have pillars of Eternity, Ori and the Blind Forest and Cities:skylines among top rated PC games; These games are all made with unity. "The handset engine is only suitable for prototyping" This is not too bad.
The fact is that any code base that has been developed for a long time and used by only a few people is really bad in a sense. They are a strange place for code. No one remembers how they work, because it was done many years ago, meaningless, and no one to fix it. In a large enough code base, no one can know all the details about how it works, so in some other subtle way there are some decision conflicts. Borrow someone's word "there is codebases that suck, and there is codebases that aren ' t used":)
It's important to try to improve the code base! We've been insisting on improving it. We've made a lot of improvements in all aspects, but frankly, the rendering code has improved a lot in recent years, and no one has put the code into a full-time job of maintaining and upgrading alone, but we did!
Several times I pointed at some code and said, "haha! that's stupid!" This is the code I wrote. Or a code with a variety of factors (lack of time, etc.). Maybe I was stupid at the time, maybe I would say the same thing in five years.
Wish listThe high throughput of the system, without the bottleneck of the work.
(Unity5.0) Now rendering, the CPU code of the shader run & Graphics API is not very efficient. It has some problems that need us to solve as much as possible:
1. Graphics accelerator (GFX) device (our abstract rendering API)
A. Abstraction is primarily a conceptual design around DX9/parts of the DX10. For example, constant/uniform (Constant/uniform) buffering is not suitable now.
B. After a few years more and more chaotic, some places need to clean up
C. Allow parallel command buffers to create modern APIs (such as consoles, DX12, Metal, or Vulkan).
2. Rendering Loops
A. Many ineffective things and redundant decisions need to be optimized
B. Loops running in parallel and/or their jobify part. Create an API using native command buffering, if possible.
C. Make the code simpler and more unified. Analyze common functions. More testability.
3. Shader/material operation
A. The layout of the data in memory is not particularly good
B. Want to clean up complex code. Increased testability.
C. The concept of "fixed function shader (shaders)" should not exist at runtime. See "Generating Fixed Function Shaders at Import time".
D The text-based shader format is foolish. See "Binary Shader Serialization"
LimitNo matter what we optimize/clean up the code, we should try to keep their functionality available to work. Some rarely used features or special situations may be altered or destroyed, but this is only as a last resort.
Another thing to consider is that if some code looks complicated, it may be due to several reasons. One of them is "the code that someone wrote is too complicated" (great!). I want to simplify it. Another possibility is that "there's some code that's complicated in the past for some reason, but it's not now" (great! I want to simplify it).
But it's also possible that the code does something complicated, such as it needs to deal with some tricky situations. It's easy to start rewriting the code from scratch, but in some cases it can become complicated once you start to make your new and good code do everything that the old code does.
Planadd a piece of CPU code to improve its performance in several ways: 1) "Just make it Faster" and 2) make it more parallel.
I want to focus first on the "just make it faster" section. Because I also want to simplify the code, and think of doing a lot of tricky things. Simplifying the data, making the data flow clearer, and making the code simpler often makes it easier to do a second step ("more parallel").
First I'll look at higher levels of rendering logic ("rendering loops") and material/material runs, and others in the team will consider simplifying and handling the abstract rendering API and trying "more parallel" methods.

To perform the rendering performance test, we will need some actual content to test. I looked at several existing games and demos that allowed their CPUs to be limited (by reducing GPU load-low-resolution rendering, reducing polygon count, reducing shadow mapping, reducing or eliminating post-processing, and reducing texture resolution). Let the CPU have a higher load, I copied some of the scenes, so more than the original rendering.


The rendering test is very simple, like "Hey, I have 100,000 cubes!. "But that's not a realistic example. "Just a large number of objects using the same material" is a very different rendering situation, from thousands of materials with different parameters, hundreds of different shaders, dozens of render target changes, shadow mapping & periodic rendering, alpha blending of objects, dynamic generation of geometry, etc.

On the other hand, testing a "complete game" is also cumbersome, especially where it has to interact, slowly loading all levels, or not limiting the CPU.


testing on multiple devices is helpful when testing CPU performance. I am usually in PC Windows system (i7 5820 K), in Mac Notebook (RMBP), iOS device how good I am now using (IPhone 6) to test. Testing on the console will be great, I've always heard they have great analytical tools, more or less fixed clocks and cpu--but I'm not devkits. Maybe that means I should get one.
NotesNext, I ran the standard project to look at the analyzed data (including the Unity Analyzer and the third-party profiler inactive trap/tool), and also looked at the code to see what it did. At this point, whenever I saw something strange, I wrote it down for later investigation:

The following is from translation
Setpasswithshader at some point to avoid the Pptr Dererf optimization method. Now he seems to always do pptr dererf, and then just call Setpass (and do a dererf again).
The material display list constantly rebuilds >1 per pixel of light. No need to store more than one list of materials
Gettexturedecodevalues is called many times (something that creates a pixel light cookie), ending a useless linear gamma conversion
The material display list is continually rebuilt due to a chain reaction of unassigned Global map properties (_cube, no assignment), and we need to find out why we fail and record when our property is lost.
What does Gpuprogramparameters::makeready do? Why distinguish it?
Why is STL maps used by attribute tables?
1. Use only sound & simple data layouts
2. Related Strange Things:
1. Why is texenv separated from the attribute table?
2. setrecttextureid-Why, what
3. Map pixel width/HDR Part of decoding performance is performed in the device state
4. Notifymipbiaschanged has a lot of complicated things, the original unknown
Ispasssuitable is called again and again in the render loop. Perhaps you want to create a direct pass pointer table in the render cycle?
All stickers are provided at once instead of at a certain time SetTexture
Arrange the attribute table data in memory to match the true and unchanging buffer layout. Several different key word layouts may be required
Textureid converted to Long (Mac/linux is 64 bits) is PS4 special commit 3CBD28D4D6CD
1. Looks like only the optimization on PS4, directly stores a pointer to Textureid. If you can do it anywhere, or just PS4 it into a long (intptr_t better)
Why do channelassigns and Vertexcomponent pass all the time? It doesn't seem to be useful.
The rendering cycle classification is very expensive. Open and complete it based on the hash allocation (render loop Assignment).

Some of these flaws may be caused by some reason, in which case I add comments to explain them. There might have been some reasons, but now it's gone. In both cases, the source control Log/comment feature is very helpful and asks the person who wrote the code why. Half of the list above may be because I've been writing this way for years, which means I have to remember these reasons even though they "seemed like a good idea at the time".
This is the introduction. Next time, do something about the list above!

Next article to be translated

------translated from wolf96 http://blog.csdn.net/wolf96


Unity5 optimization of internal rendering 1: Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.