I recently read a PPT of ogre2.0, which touched a lot. In fact, ogre has always been criticized for its performance problems, it is also a problem with our engine. Although we often take the efficiency of ogre and gamebryo as the opposite teaching material, we have not achieved the ultimate performance optimization compared with GPU, CPU performance optimization is much harder to come by, just like a book on game development. Many people talk about API/rendering and less about architecture and logic. Many people think that engine development is just about graphics development, it seems to be true for Games in China, but if it is actually done, resource management, Scenario Management, animation, physics, AI, UI, sound effects, scripts, even skill systems are very deep. This is the result. Many self-developed engines can optimize GPU performance well, but they are still not smooth, most of them are bottlenecks in data processing. at the earliest, people think that the "Three Views" is a PPT of battlefield3, breaking the traditional tree/map-based Scenario Management Model of more than 15000 objects, parallel Execution of a linear array of Bruce force Culling is three times faster than tree-based management, and the code volume is only 1/5. Why? This is because the current hardware architecture determines that most of the bottlenecks are that there is a very fast cache between the CPU and memory on the data access page. If the data can be directly found in the cache, it will be much faster than load from the memory. How much faster is it? I can only say that it is not an order of magnitude because it is totally different from the "Data Structure" and "Introduction to algorithms" in a university ", the so-called O (n) algorithms in the current field and current hardware architecture are undoubtedly misleading. They are all "Brach heavy, there are many "cache misses". For example, in some cases, binary search is not as fast as linear traversal, coupled with the brainwashing of various object-oriented theories, the entire process has taken the wrong path. Of course, battlefield3 also utilizes the parallel (parallel) idea. It is not just "cache frieldly" that we can say so much about efficiency, actually, I want to clarify one sentence (in ogre2.0 ppt): SIMD, parallel, cache-friendly algorithms are the industry standard today! When the hot function of vtune does not see anything, when the gpa gpu bar chart is average, but the performance is still not good enough, isn't it crazy? Think about the above sentence, there is a direction for optimization!
- SIMD, cache friendly
-
- In fact, most people are thinking about engines. Many people will say, "I am familiar with SSE"
- I'll go and check out the code they wrote, and even the data structure memory is not aligned. I'm glad to say "sse I'm familiar "......
- In addition, try to store the same type of data in the continuous memory space and perform sequential access.
- If necessary, you can even use the prefetch command to load data to the cache.
- The programmers who like to use if-else are not good programmers.
- SOA vs AOS
-
- In many cases, SOA (struct of array) is faster than AOS (array of struct), because in most cases, we traverse a struct array and only access one of the fields.
- The difference between SOA and AOS is one of the differences between object-oriented and data-oriented programming.
- Class vs struct
-
- This is the language difference between object-oriented and data-oriented
- We first realized that class performance would be faulty, from the N3 code. floh says why the abstract layer of the engine platform does not use abstract classes because virtual functions have poor performance in the hardware architecture of the host plane. in essence, it is still the cache miss problem.
- When I first started writing a program, I saw some people say that I used struct at the beginning, then I used class, and finally I used struct at the end. I think this is also a process from getting started-> Improving Design-> Improving Performance.
- Parallel
-
- Now there are more and more CPU cores and even 4-core 8-core mobile phones. Our game has defined dual-core as entry-level configuration.
- Open the task manager and check the CPU usage. When one core is full, others are idle -_-
- Therefore, from the engine architecture, the first step required for performance optimization is parallelism.
- Generally, there are two methods
-
- Module division: Io a thread, rendering a thread, physical thread, logic a thread, etc.
- Task division: animation computing, scene elimination, AI pathfinding computing, particle computing, and so on can be split into small tasks and thrown into the task system (essentially a thread pool, and PS3 can be SPU) computing
- Many lag problems are caused by the long API call time. They can be put into background threads for calling, such as disk Io, shader compilation, and DirectX API calls.
- Memory, bandwidth
-
- Another optimization direction is to minimize memory usage.
- First, reduce the number of data processed.
- First, the memory access efficiency can be improved in terms of unit usage
- Memory alignment. For more information, see SIMD.
- Bandwidth considerations are more about the GPU end. such as vertex, texture reduction, framebuffer upsampling, pixelformat compression, UV operation (affects cache), overdraw reduction (fillrate optimization), shader branch reduction (affects cache)
- From forwardshading to deferredshading is an algorithm complexity consideration. From deferredshading to deferredlighting is a consideration of bandwidth and flexibility. now there is another tilebasedrendering, because hardware changes bring about changes in algorithms/architectures.
In general, hardware is constantly being upgraded, and our minds also need to be updated to keep up with the trend. Object-oriented Technology accelerates development efficiency, but it is not machine-friendly. in the performance-oriented field, it is not very suitable. In the end, it is a game between people and machines.