Unity5 internal rendering optimization 2: cleaning, unity5 Rendering

Source: Internet
Author: User
Tags bit set

Unity5 internal rendering optimization 2: cleaning, unity5 Rendering

Translated from the aras blog, there are three articles in total to describe the process of unity5 optimizing its own Renderer
Learn from debugging and optimization experience and understand the optimization methods of the unity5 internal Renderer.

Previous Article: Unity5 internal rendering optimization 1: Introduction


After the introduction, let's proceed with the actual work.

 
As mentioned in previous articles, I first try to think of/find out the existing code, perform some analysis, and write down the highlights.
The analysis of multiple projects mainly reveals two things:
1. using multithreading for rendering code is really wider than using our existing "one main thread and one rendering thread. Here is a screenshot of the timeline profiler from unity5:
 
In this special case, the CPU bottleneck is the rendering thread, and most of the time is spent on glDrawElements (this is in MacBookPro; 6000 drawcall is processed in the demoGPU simplified scenario of the butterfly effect ). The main thread only ends and waits for the rendering thread to catch up. Relies on hardware, platforms, and graphic APIs. Bottlenecks can occur anywhere. For example, the same program consumes the same time on the main thread and rendering in a faster PC under DX11.
The culling sliver seems to be good, and we finally hope that all of our rendering code will be so good. Here, we need to enlarge and remove the following parts:
 
2. there is no "optimize this function to make everything double faster": (shuffling data will be a long journey, removing unnecessary decisions, remove the small things here and there until we can achieve "twice the speed of each thread ". If.
Rendering data analyzed by threads is not particularly interesting. Most of the time (all highlighted below) is the OpenGL runtime/driver. Add some comments about what stupid things we have done and the driver has done too much extra things (I don't know, there is no good reason to switch different vertex la S, etc ), however, we have not read much about it. Most of the remaining time is spent on dynamic batch processing.
 
Let's look at the functions that consume a lot in the main thread. We get these:
 
Of course there is a problem now (why are there so many hash table searches? Why does sorting take so long? Wait, take a look at the list above), but the key is that without any Optimization in one place, it will bring magic performance benefits and a pony (this is a metaphor ).
Observation 1: The material "display list" is always rebuilt in our code. In the rendering thread, a material can be recorded in advance and we call it a "display list ". Think of it as a small command buffer and store a group of commands ("set the grating State object, set this shader, this texture "). The most important thing about them: they store all the parameters (final Paster values, shader uniform variable values, and so on ). When "applying" is a display list, we only need to switch it out of the rendering thread and do not need to find the material attribute value or other things.
Everything works fine, except that the display list of records is invalid when something changes in the material. In unity, each shader contains many types of shader deformation. When selecting a different shader deformation, We need to request a different display list. If you use a scenario that is created to cause the same material to alternate different lists, then the problem arises.
What happened in these standard projects? The short story "the reason for this is that multiple pixel-by-pixel light sources rendered in forward ", the result is that we have code to process this branch, and it just needs to be completed-so I found it and compiled it very effectively in the current code library. The current material can be pre-recorded with more than one "display list", and this problem also disappears.
PC (core i7 5820 k), a scenario in the main thread from 9.52ms to 7.25ms is quite terrible.
Spoiler: The biggest benefit of this change is that in the affected scenarios, it took almost two weeks for me to do everything. That code is not even "written" by me. I just got it from some neglected branches. So, yeah! A very simple change has increased the performance by 30%!
Observation 2: there are too many hash table queries from the above observation list, and we found the problem "Why are there so many hash table queries.
In the rendering code, I added a sentence like this many years ago:
Material: SetPassWithShader (Shader * shader ,...)
The called code already knows which shader should be set up. Material also knows this shader, But it stores something called PPtr ("persistent pointer stubborn pointer") which is essentially a handle. The pointer directly avoids handle-> pointer Lookup (currently a hash table lookup, it is difficult to create an array-based processing system for various complicated reasons)
The result is that after many changes, the Material: SetPassWithShader completes the handle-> pointer lookup twice, even if it already has the actual pointer as a parameter! Fix:
 
Translation:
Clear and optimize materials through applications.
SetPassWithShader is added at some points to avoid a PPtr deref. The result is that it is still "Twice" m_shader PPtr derefs! That is not a bad optimization.
So they are all cleared; there is a Material. SeShaderPass which directly gets what it needs, so as not to discard it and go to the subshader. In addition, the pass pointer only displays the list based on the direct cache. This allows the removal of shadow caster passes to copy code in special cases.
Vikinglanguagestatic workbench project, i7 5820 k, main thread Camera. Render 7.25 ms-> 6.59 ms 17 files changed, with 135 inserted places and 213 deleted places:
OK. Our results are good, measurable, and very simple to achieve performance optimization. It is also a little bit of a code library.
Small adjustments in the rendering thread performance analysis of the Mac above, our own code consumes 2.3% in binddefaververtexarray, and it feels too much consumption. As a result, it loops through all possible vertex Component Types and checks something. So that the shader code loop is only in the vertex component part. A little faster
A project uses GetTextureDecodeValues many times. It is used to calculate the color space. The HDR and illumination texture are extracted as constants. An optional "intensity multiplier" parameter is explicitly set to 1.0 in all calls, except for one. It performs a series of sRGB mathematical operations. I noticed that some pow () calls were removed in the code. Added to a "look later" list: Why do we call this function so frequently in the first place?
Some code calculates in the rendering loop where drawcall needs to place the boundary batch processing (that is, where to switch to a new shader, and so on), and compares some States as separate bool. Package them into a domain and compare them with an integer. We can't see that the performance is getting better, but the code is actually getting fewer, so it's a win :)
(Bit field means that when information is stored, it does not need to occupy a complete byte, but only needs to occupy a few or one binary bit. For example, when storing a switch value, there are only two States: 0 and 1. Use one binary digit. The so-called "bit field" refers to dividing the binary character in a byte into several different regions and showing the digits of each region. Each domain has a domain name, which allows operations by domain name in the program. In this way, several different objects can be represented by a byte binary field .)
Note that figuring out which vertex buffer and vertex layout are used by Object Query mesh data is a very far part of the memory. Sorts data based on the purpose type (rendering data, collision data, animation data, etc)
It also reduces the data Packaging Vulnerability msinilo's excellent CruncherSharp (made some adjustments in this method :)) (For CruncherSharp) I heard that there is a small tool for Linux (pahole ). Struct_layout is available on Mac, but it can always be executed in unity, and Python scripts often fail due to overflow exceptions.
 
When browsing the code, we find that the method of tracking each Paster's mipmap bias is very complicated. When a texture tracks all the material attribute tables to be used, it sets each texture and notifies them of changes on any mip bias, in addition, bias is obtained from the performance table and each texture applied together. Each time a texture is set on the graphic driver. Oh, my God. Fixed. This changes the interface of the image abstraction API, which means that all 11 rendering backends are changed; A few very insignificant changes, but it feels terrible (I can't even build half of them locally ). Don't be afraid. We have a construction farm to check for compilation errors and reverse checks with the test program group!
 
Translation:
Make the texture mip bias sound processing.
When bias is part of the texture phase, it is only so weird, not texture. Or maybe it's something else.
Now it is only a part of the settings of the texture filter/deformation warp/ aniso, and is applied only when it is changed.
Gone:
Bias is applied to each and all gfxdevice. SetTexture calls,
Texture: NotifyMipBiasChanged, (blogger note: the mip bias change is notified first)
TexEnv: TextureMipBiasChanged, (blogger Note: call to change the mip bias texture)
TexEnvData: mipBias. (Blogger note: we can indeed perform the bias Operation)
80 files changed, 228 was inserted, and 263 was deleted.
There is no obvious performance difference, but it feels less complicated. Add a "look later" list: We track that each texture has a lot of the same data; some have UV scaling for non-quadratic power-restricted textures (NPOT. I suspect there is no reason for it these days. If possible, continue to observe and remove it.
Some other similar local adjustments have also been made, each of which is very simple, so that some special areas are better, but no performance improvement is observed. They may be magnified one hundred times to see some obvious effects, but they have more possibilities. We need to redo some serious things to get better results.
The layout of the material attribute table has always plagued me by how to store material attributes. Every time I showed the code library to new programmers, I turned my eyes and said, "Yeah, we store textures, matrices, colors, and so on in the materials, in the separated STL map. This is annoying. .".
(
Map is one of the standard associative containers. A map is a sequence of key-value pairs, that is, (key-value) pairs. It provides key-based quick retrieval capabilities, and the key value is unique in a map. Map provides a bidirectional iterator, that is, the iterator from the past and the reverse_iterator from the back ).
Map requires that you can perform the <operation on the key and keep the increment order by the key value. Therefore, the iterator on map is also incremental and orderly. If you do not need to keep the elements in order, you can use hash_map.
Http://www.cnblogs.com/skynet/archive/2010/06/18/1760518.html)
The popular idea is that c ++ STL containers have no place in high-performance Code and are not good for games to use it (not true ), if you use it, you must be stupid and ridiculed (I don't know... Maybe ?). So, hey, how do I replace these maps with a better data layout? It must make everything better 1 million times, right?
In Unity, the shader parameter can come from two places: data of each material, or the "Global" material parameter. The former is usually "diffuse texture", and the latter is like "fog color" or "Camera projection" (blogger note: the diffuse texture map parameter is unique to each shader, And the fog color is common to every shader in unity's RenderSettings. (The parameters of each instance are somewhat complex, such as MaterialPropertyBlock, but now let's ignore it)
Our previous data layout is like this:

Map <PropertyName, float> m_Floats;
Map <PropertyName, Vector4f> m_Vectors;
Map <PropertyName, Matrix4x4f> m_Matrices;
Map <PropertyName, TextureProperty> m_Textures;
Map <PropertyName, ComputeBufferID> m_ComputeBuffers;

Set <PropertyName> m_IsGammaSpaceTag; // which properties come as sRGB values


I replaced it with (simplified, show only data members; dynamic_array is much like std: vector, but more EASTL-style ):


Struct NameAndType {PropertyName name; PropertyType type ;};

// Data layout:
//-Array of name + type information for lookups (m_Names). Do
// Not put anything else; only have info needed for lookups!
//-Location of property data in the value buffer (m_Offsets ).
// Uses 4 byte entries for smaller data; don't use size_t!
//-Byte buffer with actual property values (m_ValueBuffer ).
//-Additional data per-property in m_GammaProps and
// M_TextureAuxProps bit sets.
// Data Layout:
//-The array name can be searched with (m_Name) information as long as the name + type. Don't add anything else
// You only need to query the key information.
//-The performance data is stored in value buffer (m_Offsets. Use 4 bytes to record small data
// Do not use size_t!
//-The actual attribute value (m_ValueBuffer) uses Byte Cache
//-Set each attribute of the appended data in m_GammaProps and m_TextureAuxProps to bit.
//
// All the arrays need to be kept in sync (sizes the same; all
// Indexed by the same property index ).
// All arrays need to be synchronized (the size is the same; the index with the same attribute)
Dynamic_array <NameAndType> m_Names;
Dynamic_array <int> m_Offsets;
Dynamic_array <UInt8> m_ValueBuffer;

// A bit set for each property that shoshould do gamma-> linear
// Conversion when in linear color space
// Gamma-> linear conversion should be performed in a linear space, and a bit should be used for each attribute.
Dynamic_bitset m_GammaProps;
// A bit set for each property that is aux for a texture
// (E.g. * _ ST for texture scale/tiling)
// For each attribute, a bit setting is a texture secondary aux
// (For example,. * _ ST for texture scale/tiling)
Dynamic_bitset m_TextureAuxProps;

When a new attribute is added to an attribute table, it is only attached to all arrays. The attribute name/type information and attribute location are separated in the data buffer. Therefore, when searching for an attribute, we do not even obtain data that is not required for the search.
The biggest external change is that before this, one is to locate an attribute value and store a direct pointer to it (used to pre-record the material display list, the value of the global shader attribute that can be "patch in temporary access" before replaying them) now every time the array size is changed, the pointer will be invalid; so instead, all the code that can be stored in the pointer, you must change the offset offsets to be stored in the property list. So some Code was modified.
 
The search attribute has been changed from an O (logN) operation (texture search) to an O (N) operation. If you have studied computer science, you will know that this is not a good thing. It is a typical taught. However, I have read various projects and found a typical situation. The Attribute Table contains a total of 5-30 attributes (mostly around 10 ); in addition, a linear scan of all the searched data is close to other data in the memory, this is not as bad as comparing STL map nodes to find map nodes to stay away from other points at will (if this happens, each node can be lost as a CPU buffer memory ). The performance of several different projects is analyzed. Some of them are "search properties" that have been fast on PCs, laptops, and iPhones.
But does this change bring about performance improvement like magic? No. it slightly improves the average frame time and consumes less memory, especially when there are a large number of different materials. But is there a magic result in "replacing STL maps with packed arrays? No. Well, at least I don't have to flip my eyes and show this code to others. That's all.
It takes about two weeks for the results to work (I guess only 75%-The rest is spent on other unrelated fixes, code checks, etc.). All platforms are created and tested; prepare the request. 40 submissions and 135 files, about 2000 lines of code changed.
 
Translation:
Most of the time storage is from a better gfxdevice display list of ultra-high-speed cache, especially the material always choose different keywords this situation (for example, the Viking Village game is very fast in PC 12 ms-8.5 ms. It is a little faster in other cases, but it is not very fast (combined with the above, link to the document details on google ). In general, the frame rate of the result is more stable, probably because the memory is rarely allocated during the operation.
There are more than one display list of material in the buffer storage.
Improve the data layout of the Attribute Table (6 std: map-> 3 dynamic_array and 2 bitsets ). This means that you cannot store pointers as values; change all code to store offsets. More unit tests and performance tables are added!
Pipeline makes texture mip bias sound processing, and now it only has the filter/wrap/aniso parameter, instead of being cyclically tracked and applied in each SetTexture call.
 SetPassWithShader has made an optimization to avoid a PPtr deref at some points, but now it is always doing derefs twice! Clear it.
 Remove the world matrix settings and replace them with light props, Which are consistent before the forward loop. It seems useless.
Worker slightly optimizes the thread display list to repair data.
Using OpenGL: slightly optimized BindDefaultVertexArray (like GLES)
Special GetTextureDecodeValues
Based on platform folks, the grid buffer will never be lost on Android/Tizen taize.
Recall recall TextureID to all 32-bit platforms except ps4
 More sealed struct/class packages in some places
The hacker removed unnecessary things (such as MatrixVal) from shaderlab and added comments in many places.
Miscellaneous: support-the packet command line is also controversial when the gamer is released, and on mac/linux



Superior performance, a standard project has improved a lot (the most affected is the problem of "display list reconstruction"), and the total running time on the pc ranges from 11.8ms to 8.5 ms; run 29.2 ms to 26.9 ms on the notebook. Other projects have been deleted, but the changes have been minimal (from 7.8ms to 7.3 ms on pc; from 15.2ms to 14.1 ms on iphone)
Most of the performance improvements come from two places (the display list is rebuilt; useless hash table search is avoided ). Not sure about other changes-the overall perception is that they are the best changes, if only because I have a good understanding of the code base and have added a large number of annotations for what and why. I even have a long list of "the weird places here need to be improved ".
It took me nearly two weeks to get this result. Is it worthwhile? It is hard to say. Sometimes I spent a week, but I felt nothing was done, so it is better than this :)
In general, I am not sure whether "optimization" is my strength. I think I am very good at only a few things:
1. Difficult debugging problems-difficult debugging problems-I can quickly come up with reasonable assumptions and methods to solve the problems one by one.
2. understand some changes or the meaning of a system-other systems will be affected, and what will/will cause problematic interaction.
3. I have a good understanding of the environment where things are solved by other things in the code library-I can often find out that several people overlap on the same things and tell them "Yo, you two should coordinate and integrate it"
Are these useful optimization skills? I don't know. Of course, I cannot balance the command delay and execution port with TLB misses in my brain. But maybe it would be better if I practiced it? Who knows?

I don't know which path to take next. I can see several feasible ways:
1. continue to improve, and expect most of them to work well. A few points may be disappointing, because it is really hard to weigh.
2. start to look at the bigger part and find out a lot of work we have done that can be completely avoided. The more serious thing is to reshape the structure.
3. Once some cleanup is completed, switch to the "multi-threaded Multi-material" method to help others.
4. optimization is too difficult! Let's play rock Smith more until the situation improves.
I think I will discuss it with a few people and do more. See you next time!



The blogger concluded that aras has a very good code style and has many comments when writing code. Therefore, it is very efficient to optimize and the optimization is accurate. Aras also makes good use of analyzer .... In short, I have benefited a lot after reading it .. Hope unity is getting better and better.


To be translated...

---- Translated from wolf96 http://blog.csdn.net/wolf96

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.