I saw a very good essay on Weibo recently on how to optimize the performance of C/s + +, although it is a field of light tracking in graphics, but it has universal significance, translating it into Chinese, hoping to help you write high-quality code.
1. Keep in mind the Amdahl law: acceleration ratio ==
- This represents the percentage of the total time that the function func executes, representing the speedup obtained for the function.
- For example, if you optimize a function triangleintersect(), which takes up 40% of the total time, and the execution time is reduced by half, the entire code gains a 25% performance boost.
- This formula tells us that code that is not often used (such as a scenario loader) does not need to be optimized (or not optimized at all).
- In a word, "make regular use of code faster and do not use code correctly".
2. Optimize the code on the premise that correctness is guaranteed!
- But that doesn't mean you need to take 8 weeks to implement a full ray-tracking code, and then spend another 8 weeks to optimize!
- Code optimization is divided into several steps.
- First write the correct code, and then look for the function that is often called to optimize it.
- Finally, look for the entire code bottleneck, by optimizing or modifying the algorithm to remove bottlenecks. The usual algorithm improvements can shift the code's bottleneck-even to your unexpected functions. For this reason, we need to explicitly optimize all the functions that are frequently visited.
3. The Code gurus I know are claiming that they are optimizing their code twice times longer than they write code.
4. Jump/Branch operation is time consuming. Assume that you can use them as little as possible.
- Function calls include two jump operations in addition to the memory on the stack.
- Try to use iterations instead of recursion.
- A function that has a shorter code length is set to an inline function to reduce the overhead of a function call.
- Place the loop inside the function (for example, code: for(i=0;i<100;i++) dosomething (); Can be changed to dosomething () {for (i=0;i<100;i++) {...}}to reduce the number of function calls).
- With very long if...else if...else if...else If ... The code of the chain requires a lot of jump operations when inferring the following conditions. Suppose it is possible to change it to a switch code structure, because switch implements a single-step jump by querying a jump table. Assuming that you cannot change to a switch structure, place the most likely access inference condition at the front of the condition chain.
5. Carefully consider the order of the array subscript.
- A two-dimensional or higher-dimensional array is actually a form of one-dimensional existence in memory. This means that (for C + + arrays) elements Array[i][j] and array[i][j+1] are adjacent in memory, but the elements Array[i][j] and array[i+1][j ] but can be a different position.
- Let's say that when we visit an in-memory array, we implement it in an approximate linear order, and the code generally gets very noticeable acceleration (sometimes it reaches a magnitude or even higher)!
- Today's CPUs load data from memory into the cache, and instead of just loading one of the data to be interviewed, it loads a chunk of memory (cache line) into the cache. This means that when the element Array[i][j] is in the cache,array[i][j+1] is also very likely in the cache, but Array[i+1][j] is still in memory.
6. Consider instruction-level parallelism.
- Although there are still a lot of programs that are single-threaded, today's CPUs are already running parallel on a single core. This means that it is possible to run 4 floating-point multiplication at the same time on a single core, waiting for 4 memory requests and the incoming branch operation.
- In order to take advantage of instruction-level parallelism, code blocks (between jump instructions) need to have sufficient independent instructions (there is no correlation between instructions: data-related, control-related, and name-related).
- can consider cyclic expansion.
- Consider inline functions.
7. Avoid using or reducing the number of local variables.
- Local variables are usually placed in the stack space. However, assuming that the number of local variables is small enough, they can be placed in registers. In this case, the function can get much faster access than the memory, and also avoids the layout of the stack space at the same time.
- However, do not change them all to global variables in order to reduce the number of local variables.
8. Reduce the function's number of references.
- Consistent with the reason for reducing local variables-they are all placed on the stack.
9. Structs are passed by reference rather than by value.
- As far as I know, there is no structure in ray-tracing code that needs to be passed by value (even simple vectors, dots, and colors).
10. If the function does not need to return a value, do not define the return value.
11. Avoid the type-strong turn as much as possible.
- Integer and floating-point instruction operations are typically in different registers, so the type-strong-turn means that the data is copied.
- The short integer (char and shorter) still requires a complete register to store. When it is stored in memory, it is converted to byte storage again (it is only necessary to take into account the factors of space consumption, but it is time-consuming to save space in a strong turn).
12. When declaring C + + object variables, please be careful.
- Use the initialization of the object instead of the assignment operation (Color C (black); than Color C; c=black; To be fast).
13. Design the default constructor for the class as small as possible.
- Especially those very small but often used classes (such as color,vector,point, etc.).
- The default constructor is often called when you don't expect it.
- Use the constructor to initialize the list (using color:: Color (): R (0), g (0), B (0) {}, not Color::color () {r=g=b=0;} )。
14. Use shift operations as far as possible >> and << Replace integer multiplication method.
15. Be cautious when using the look-up function (ray tracing only).
- A lot of people recommend that you use pre-computed values to replace complex functions (such as trigonometry). But in ray tracing, this is usually not necessary. Memory lookups can be time-consuming, which counteracts the times when trigonometric functions are computed again (especially considering that a lookup table destroys the CPU's cache structure).
- On other occasions, it is generally helpful to look up a table. In GPU programming, it is agreeable to look up a table instead of a complex operation.
16. For most class operations, use + =,-=, *=, and/= instead of +,-, *, and/.
- Simple operations are used to create unnamed or temporary intermediate objects.
- For example:vector v = vector (1,0,0) +vector (0,1,0) + vector (0,0,1); 5 unnamed, temporary vectors were created: vector (1,0,0), Vector (0,1,0), Vector (0,0,1), Vector (1,0,0) +vector (0,1,0), and vector (1,0,0 ) +vector (0,1,0) + Vector (0,0,1).
- Slightly longer code:vector v (1,0,0); V+=vector (0,1,0); v+= vector (0,0,1) ; Only two temporary variables were created:Vector (0,1,0) andvector (0,0,1). This saves 6 function calls (3 constructors and 3 destructors).
17. For basic data types, use +,-, *,/instead of + =,-=, *=, and/=.
18. Postpone the declaration of the local variable.
- Declaring a class object requires a function call (a call to a constructor).
- If a variable is used only under a certain condition, declare the variable only where it really needs to be.
19. For the object (the base variable is optimized), use the prefix operation (++obj) instead of the suffix operation (obj++).
- This is not a problem in ray tracing.
- The suffix operation brings a copy of the object once (this requires an extra call to the constructor and destructor), but the prefix operation does not require a temporary copy.
20. Be careful when using the template.
- Different instantiation of the optimization method is different!
- The standard template class has been optimized very thoroughly, but it is still avoided.
- Why not? This is because assuming we implement an algorithm ourselves, we know the details of the algorithm and how to optimize it.
- More importantly, my experience tells me that compiling code that includes STL in debug mode is generally very slow. This is generally not a problem unless you use the Debug version number for performance analysis. You will find that STL constructors, iterators, and so on will take up more than 15% of the execution time, which results in confusing performance output.
21. Avoid dynamic memory allocations in the process of calculation.
- Dynamic memory allocation is suitable for storing data that is not changed during the scene and during the calculation.
- However, in most systems, dynamic memory allocation requires the use of locks to control allocator access. For multithreaded applications, adding additional processors can also degrade performance, because using dynamic memory allocation causes threads to wait for memory allocations and releases.
- allocating memory on the heap is more time-consuming than allocating memory on the stack, even for single-threaded applications. Because the operating system needs to do some calculations to find the appropriate size of memory blocks.
22. Find and utilize all the information that optimizes the system cache access.
- Given that a data structure can be stored in a single cache line, a single memory interview can load the entire class into the cache.
- Make sure your data structure is aligned to the boundary of the cache line (assuming that you have 128 bytes in both the fabric and the cacheline, but the 1th byte of the struct is in the first cache line and the remaining 127 bytes are in the second cache line, then the performance will be poor).
23. Avoid unnecessary initialization of data.
- Let's say you need to initialize large chunks of memory, use memset ().
24. Try the early termination of the loop and the early return of the function.
- Consider the intersection of a light and a triangle. The reasonable case is that the light passes only the triangles. This is where optimization is needed.
- Suppose you need to determine the intersection of the light and the triangular plane, and when the light-plane intersection's T value is negative you can leave the return. This allows you to avoid about half of the centroid calculations in the intersection process.
25. Simplify the equation on paper.
- In very many ways, in fact some items can be asked to go around in some cases.
- The compiler cannot find the simplification of the equation, but you can. Sometimes the only time-consuming operation to eliminate the internal loop is to be able to reach the results of several days of working in other places.
26. The mathematical differences between integers and floating-point numbers may not be as large as they might have imagined.
- In modern CPUs, floating-point operations have the same throughput as integer operations. Like ray tracing, this computationally intensive application has very little difference between floating-point operations and integer operations. Therefore, there is no need to find ways to turn operations into integer operations.
- Double-precision floating-point number operations may be slower than single-precision, especially on 64-bit machines. But it could be reversed.
27. Consider deforming the operation expression to reduce the time-consuming operation.
- The sqrt () operation should be avoided as much as possible, especially if the same result is obtained through the square operation.
- Suppose you need to do a lot of dividing by X, you can consider the first calculation and multiply the result. This results in a greater benefit in vector Normalization (3 division). But lately I've found things to be a little more difficult to infer. Just assuming that the number of divisions is more than 3 times, this optimization method will still bring about a large performance improvement.
- Let's say you're running a loop, and the calculation that keeps the results constant is mentioned outside the loop.
- Consider whether you can incrementally compute a variable (instead of calculating it from scratch).
"Tips for optimizing C + + Code" translation