27 Recommendations for "Go" c\c++ code optimization

Last Update:2015-02-09 Source: Internet

Author: User

Tags square root

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Remember the Amdahl law:

Funccost is the percentage of function Func run time, andfuncspeedup is the coefficient that you run to optimize the function.
So, if you optimize the function triangleintersect perform 40% run time, make it run nearly twice times faster, and your program will run fast 25%.
This means that code that is not used frequently does not require much optimization considerations (or is not optimized at all).
There is a saying: let the frequently executed paths run more efficiently, while running scarce paths run correctly.

2. The code is guaranteed to be correct before considering the optimization

This does not mean that it takes 8 weeks to write a full-featured ray-tracing algorithm and then spend 8 weeks optimizing it.
Multiple steps to optimize performance.
Write the correct code first, and when you realize that the function may be called frequently, it can be significantly optimized.
Then find the bottleneck of the algorithm and solve it (by optimizing or improving the algorithm). Often, an improved algorithm can significantly improve bottlenecks-perhaps using a method that you have not yet thought of. All functions that are called frequently need to be optimized.

3. The people I know who write very efficient code say they are optimizing the code by twice times the time it takes to write code. 4. Jump and branch execution cost high, if possible, as little as possible.

A function call requires two jumps, plus a stack memory operation.
Use iterations instead of recursion first.
Use inline functions to handle short functions to eliminate the overhead of function calls.
Move the function calls within the loop to the outside of the loop (for example, i=0;i<100;i++) dosomething (); Change to dosomething () {for (i=0;i<100;i++) {...}}).
If...else If...else If...else If ... A long branch chain that executes to the last branch takes a lot of jumps. If it is possible to convert it to a switch declaration statement, the compiler will sometimes convert it to a table query for a single jump. If the switch declaration is not a row, place the most common scenario at the front of the If branch chain.

5. Think carefully about the order of the function subscripts.

Two-order or higher-order arrays are stored in memory in memory or in a one-dimensional way, which means (for C + + arrays)Array[i][j] and array[i][j+1] are adjacent, but Array[i][j] and Array[i+1][j] may be far apart.
Accessing the data stored in real memory in an appropriate way can significantly improve the execution efficiency of your code (sometimes up to an order of magnitude or more).
Modern processors load data from main memory to the processor cache, loading more data than a single value. This operation obtains the entire block of data for the requested data and adjacent data (a cache row size). This means that once Array[i][j] is already in the processor cache ,array[i][j+1] is probably already in the cache, and array[i+1][ J] may still be in memory.

6. Using the parallel mechanism of the instruction layer

Although many programs rely on single-threaded execution, modern processors also provide a lot of parallelism in a single core. For example, a single CPU can perform 4 floating-point multiplication simultaneously, wait for 4 memory requests, and perform a branch pre-contract.
To maximize this parallelism, a block of code (between jumps) requires sufficient independent instructions to allow the processor to be fully utilized.
Consider expanding the loop to improve this.
This is also a good reason to use inline functions.

7. Avoid or reduce the use of local variables.

Local variables are usually stored on the stack. However, they can be stored in the CPU registers if the number is relatively small. In this case, the function not only gets the benefit of faster access to the data stored in the register, but also avoids the overhead of initializing a stack frame.
Do not convert large amounts of data to global variables.

8. Reduce the number of function parameters.

and reduce the rationale for using local variables-they are also stored on the stack.

9. Passing a struct body by reference instead of passing a value

I can't find a scene in ray tracing. You need to use the value-passing method of the struct (including some simple structures such as vector,point and color).

10. If your function does not require a return value, do not define one. 11. Avoid data conversion as much as possible.

Integer and floating-point instruction typically operate on different registers, so the conversion requires a copy operation.
shorter integers (char and short) still use an entire register, and they need to be populated with 32/64 bits, and then need to be converted to small bytes once the memory is stored back (however, this cost must be a little more than the memory overhead of a larger data type).

12. You need to be aware of defining C + + objects.

Use class initialization instead of assignment (Color C (black); than Color c; c = black; Faster

13. Make the class constructor as lightweight as possible.

These classes are often duplicated, especially for simple types that are commonly used (such as color,vector,point, etc.).
These default constructors are usually implicitly executed, which may not be what you expect.
Using the class initialization list (use color::color (): R (0), g (0), B (0) {}, instead of the initialization function Color::color () {r= g = b = 0;}.)

14. If possible, use the displacement Operation >> and << instead of the integer multiplication method 15. Be careful with table lookup functions

Many people encourage the conversion of complex functions (such as trigonometry) into using precompiled lookup tables. For ray tracing, this usually results in unnecessary memory lookups, which are expensive (and growing), and this is as fast as calculating a trigonometric function and getting values from memory (especially if you consider that triangulation has disrupted cache access for the CPU).
In other cases, finding a table can be useful. For GPU programming, table lookups are usually preferred rather than complex functions.

16. For most classes, use + = is preferred ,-= ,*= and/=, instead of using + ,- ,* ,and?/

These simple operations require the creation of an anonymous temporary intermediate variable.
For example: vector v = vector (1,0,0) + vector (0,1,0) + vector (0,0,1);? Created five anonymous temporary vector:vector (1,0,0), Vector (0,1,0), Vector (0,0,1), vector (1,0,0) + vector (0,1,0), and Vector (1,0,0) + vector ( 0,1,0) + Vector (0,0,1).
Simple conversion of the above code: Vector V (1,0,0); v+= Vector (0,1,0); v+= vector (0,0,1); only two temporary vector:vector (0,1,0) and Vector (0,0,1) were created. This saves 6 function calls (3 constructors and 3 destructors).

17. For basic data types, prefer +? ,?-? ,?*? ,?and?/, rather than + =? ,?-=? ,?*= and/=18. Postpone defining local variables

Defining an object variable often requires calling a function (constructor).
If a variable is required only in certain situations (for example, within an if declaration statement), it is defined only when it is needed, so that the constructor is called only when it is used.

19. For an object, use the prefix operator (++obj) instead of the suffix operator (obj++)

This may not be a problem in your ray-tracing algorithm.
Using the postfix operator requires a copy of the object (which also leads to additional construction and destructor calls), whereas the prefix constructor does not require a temporary copy.

20. Use the template carefully

The difference is that the instance implementation is optimized differently.
The standard Template Library has been well optimized, but I recommend that you avoid using it when implementing an interactive ray-tracing algorithm.
Using your own implementation, you know how it uses the algorithm, so you know how to implement it most effectively.
Most importantly, my experience tells me that debugging an STL library is very inefficient. Usually this is not a problem unless you use the debug version to do profiling. You'll find STL constructors, iterators, and other operations that take up 15% of your running time, which can make it harder for you to analyze performance output.

21. Avoid dynamic memory allocation at calculation time

Dynamic memory is useful for storing scenarios and other data during run time.
However, in many (most) system dynamic memory allocations need to obtain a lock that controls access to the allocator. For multithreaded applications, dynamic memory is used in reality because of the performance degradation caused by additional processors because of the need to wait for the allocator to lock and free memory.
Even for single-threaded applications, allocating memory on the heap is much larger than allocating memory overhead on the stack. The operating system also needs to perform some operations to calculate and find a suitable size block of memory.

22. Find the information about your system memory cache and use them

If one is a data structure that fits exactly one cache row, processing the entire class from memory requires only one fetch operation.
Make sure that all data structures are aligned with the cache row size (if your data structure and a cache row size are 128 bytes, it is still possible because one byte in your struct is in one cache row and the other 127 bytes in another Cahce row).

23. Avoid data initialization that is not required

If you need to initialize a large segment of memory, consider using Memset.

24. Ending loops early and returning function calls as early as possible

Consider a ray and Triangle crossover, which is usually the case that the ray crosses the triangle, so it can be optimized here.
If you decide to cross the beam and Triangle panel. If the ray and panel cross t values are negative, you can return immediately. This allows you to skip the calculation of centroid coordinates of the X-ray triangular cross more than half. This is a big savings, and once you know that this crossover does not exist, you should immediately return to the crossover calculation function.
Likewise, some loops should end as soon as possible. For example, when setting a shadow ray, it is usually not necessary for a near intersection, and once there is a similar crossover, the crossover calculation should be returned as soon as possible. (The cross meaning here is not quite clear, may be a professional vocabulary, the translator note)

25. Simplify your equations on the manuscript

In many equations, the calculation can usually be canceled or in some conditions.
The compiler cannot find these simplifications, but you can. Canceling an internal loop with some expensive operations can counteract your optimization work in other places for several days.

26. Differences in mathematical operations for integers, fixed-point numbers, 32-bit floating-point numbers, and 64-bit double digits, not as big as you think.

In modern CPUs, floating-point arithmetic and integer arithmetic almost have the same efficiency. In compute-intensive applications, such as ray tracing, this means that the cost difference between integer and floating-point calculations can be ignored. This means that you don't need to do integer processing optimizations on arithmetic.
A double-precision floating-point number operation is no slower than a single-precision floating-point number operation, especially on a 64-bit machine. I'm in the same machine testing ray-tracing algorithms all using a double is faster than using floats all, and in turn the test also sees the same phenomenon (the original text here is: I have seen Ray tracers run faster using all doubles than all floats in the same machine. I have also seen the reverse).

27. Constantly improve your math calculations to eliminate costly operations

sqrt () can often be optimized, especially when comparing the square root of two values.
If you need to deal with the X operation repeatedly, consider calculating the value of 1/x, multiplied by it. This has been a big improvement in vector Normalization (3 division) operations, but I've recently found it a bit hard to determine. However, this still improves if you are going to do three or more division operations.
If you are performing a loop, those parts of the loop that do not change, make sure to extract to the outside of the loop.
Consider whether your computed value can be modified in the loop (without restarting the loop calculation each time).

Original: http://blog.jobbole.com/67880/

27 Recommendations for "Go" c\c++ code optimization

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More