This article isBendingThis article is written by Daniel kernelchina. It is very good. I would like to share it with you here.
Code-level optimization is the most direct and simple, but the premise is that you must be familiar with the Code and the system. After many things were done, there was one sentence: none of them, but the hands were familiar ^-^.
Before starting this topic, it is necessary to briefly introduce the cache-related content. If you are not familiar with this part of content, we recommend that you make up the course and do not know about the cache for performance optimization, basically, they are blind people riding horses.
Cache considerations
In general, cache needs to focus on the following aspects:
1) cache hierarchy
The cache level. Generally, the cache includes L1, L2, and L3 (L indicates level. Generally speaking, L1 and L2 are integrated into the CPU (which can be called on-chip cache), while L3 is placed outside the CPU (which can be called off-chip cache ). Of course, this is not absolute, and different CPU practices may be different. Register should be added here. Although register is not a cache, it can improve performance by putting data in register.
2) cache size
The capacity of the cache determines how much code and data can be put into the cache. Only when the cache is available can we compete and replace it. Only then can we have room for optimization. If a program's hotspot (Hotspot) has completely filled the entire cache, it will be a waste of effort to consider optimization from the cache perspective. The goal of our optimization program is to put the program into the cache as much as possible, but it is still difficult to write the program to occupy the entire cache. Such a large code path, the Code logic must be quite complex (basically impossible, at least I have never seen it) as to how much code is needed ).
3) cache line size
The CPU load data from the memory is a cache line at a time; the write data to the memory is also a cache line, so the data in a cache line should be read and written separately, otherwise it will affect each other.
4) cache associative
Cache Association. Full associative, memory can be mapped to any cache line, N-way Association, which is the structure of a hash table and N is the length of the conflict chain, if n is exceeded, it must be replaced.
5) cache type
Including I-Cache (command cache), D-Cache (data cache), TLB (MMU cache), and L1 and L2, there are caches that distinguish commands from data, and caches that do not distinguish commands from data.
For more information about cache, refer to the following link:
Http://en.wikipedia.org/wiki/CPU_cache
The signature is the cache.pdf in the attachment, which has a simple summary.
Code hierarchy Optimization
It mainly involves the following two aspects:
1) I-Cache-Related Optimization
For example, streamline the code path, simplify the call relationship, and reduce redundant code. Minimize unnecessary calls. But it is useful or useless, and it is related to the application. Therefore, code-level optimization is mostly aimed at the optimization of an application or performance indicator. Targeted optimization makes it easier to get considerable results.
2) D-Cache-Related Optimization
Reduce the number of D-Cache misses and increase the number of valid data accesses. This is more difficult than I-Cache Optimization.
The following is a list of code optimization tips that need to be constantly supplemented, optimized, and filtered.
1) code adjacency (put the relevant code together), recommendation index: 5 stars
Put the relevant code together has two meanings: first, the relevant source files should be put together; second, the related functions should also be adjacent in the object file. In this way, when the executable file is loaded into the memory, the function location is adjacent. Adjacent functions have a low probability of conflict. In addition, the related functions are put together to meet the requirements of Modular programming: High Cohesion and low coupling.
If a function on the code path can be compiled together (requires compiler support and compilation of related functions), it will obviously increase the I-Cache hit rate and reduce conflicts. However, a system has a lot of code paths, so it is impossible to have all of them. Different performance indicators may conflict during optimization. Therefore, it is difficult to optimize the case as much as possible.
2) cache line alignment (Cache alignment), recommendation index: 4 stars
Data spans two cache lines, which means two loads or two stores. If the data structure is aligned with the cache line, it is possible to reduce one read/write operation. The cache line alignment of the first address of the data structure, which means there may be a waste of memory (especially the continuous distribution of data structures such as arrays), so we need to weigh the space and time.
3) branch prediction (branch prediction), recommendation index: 3 stars
Code is ordered in the memory. For the branch program, if the code after the branch statement has a higher execution probability, it can reduce the jump. Generally, the CPU has the command prefetch function, which can improve the probability of hit in the Command prefetch. Branch Prediction uses macros such as likely and unlikely, which generally requires support from the compiler. This is Static branch prediction. Currently, many CPUs support storing the results of executed branch commands (the cache of branch commands) within the CPU, so Static branch prediction does not have much meaning. If the branch is meaningful, it means that any branch will be executed, so under certain circumstances, the static branch prediction results are not very good, in addition, likely/unlikely has a great impact on the Code (affecting readability), so this method is generally not recommended.
4) Data prefetch (Data prefetch), recommendation index: 4 stars
Instruction prefetch is automatically completed by the CPU, but data prefetch is a technical task. Data prefetch is based on the fact that the prefetch data will be used immediately. This should comply with the space locality. But how to know that the prefetch data will be used depends on the context relationship. In general, data prefetch is usually used in a loop because the loop is the code that best fits the spatial locality.
However, the data prefetch Code itself infringes on the Program (affects the appearance and readability), and the optimization effect is not necessarily obvious (the probability of hit ). Data prefetch can fill the pipeline to avoid waiting for access to the memory.
5) memory coloring (memory coloring), recommendation index: Not recommended
Memory coloring is a system-level optimization. It is too late to consider memory coloring during the code optimization phase. So this topic can be discussed in system-level optimization.
6) Register parameters (register parameter), recommendation index: 4 stars
As the fastest memory unit, it is a waste to make full use of registers. But how to use it? Generally, the number of parameters called by a function is less than a certain number. For example, 3 indicates that parameters are passed through registers (this depends on the ABI conventions ). Therefore, do not include so many parameters when writing a function. There is also a register keyword in the C language, but it is usually useless (I have never tried it and I don't know the effect, but I can disassemble it and look at the specific instructions. It is estimated that it is related to the compiler ). Try to read data from the Register instead of the memory.
7) lazy computation (latency calculation), recommendation index: 5 stars
Latency calculation means that variables that are not used recently do not need to be initialized. Generally, a lot of data will be initialized at the beginning of the function, but this data is not used in the function execution process (for example, if a branch is determined, the function is exited ), these actions are a waste.
Variable Initialization is a good programming habit, but during performance optimization, it may be a redundant action. You need to consider the branches of the function and make a decision.
Latency computing can also be system-level optimization. For example, cow (copy-on-write) does not copy all the page tables of the parent process during the Fork sub-process, instead, only copy the instruction part. When a write occurs, copy the data part to avoid unnecessary replication and provide the Process Creation speed.
8] Early computation (calculated in advance), recommendation index: 5 stars
Some variables need to be calculated once and used multiple times. It is best to calculate it in advance, save the result, and reference it later to avoid re-calculation every time. When there are too many functions, sometimes I will ignore what this function has done. The program writer may not understand it, but cannot help but understand it during optimization. Where constants can be used, use constants as much as possible. commands that consume CPU usage during addition, subtraction, multiplication, and division cannot be found.
9) inline or not inline (inline function), recommended index: 5 stars
Inline or not inline. This is a problem. Inline can reduce the overhead of function calls (inbound and outbound operations), but inline may also cause a large number of repeated code, making the code larger. Inline also has disadvantages for debug (Compilation and language are not the same ). So be careful when using this. For a small function (less than 10 rows), you can try to use inline. For a function that has been called many times or has a long time, try not to use inline.
10) Macro or not macro (macro definition or macro function), recommended index: 5 stars
The disadvantages of macro and inline are the same. However, I feel that macro functions can be defined instead of macro functions. Writing a function using macros poses many potential risks. Macros should be simple and refined, preferably not for use. Is not used.
11) allocation on stack (local variable), recommendation index: 5 stars
If a 1 K variable is allocated to the stack every time, is the cost too high? If the variable still needs initialization (because the value is random), isn't it a waste. The better thing about global variables is that they do not need to be re-created or destroyed repeatedly, but local variables have this disadvantage. Therefore, avoid using arrays and other variables on the stack.
12) multiple conditions (multiple condition judgment statements), recommendation index: 3 stars
When multiple conditions are judged, it is a process of gradually narrowing down the scope. The order of conditions determines whether the previous judgment is redundant. Adjust the order of conditions based on the code path and the probability of condition branches to reduce the overhead of the code path to a certain extent. However, this work is a little difficult, so it is generally not recommended.
13) per-CPU data structure (non-shared data structure), recommendation index: 5 stars
Per-CPU data structure is a common technique in multi-core, multi-CPU or multi-thread programming. The purpose of using the per-CPU data structure is to avoid the Lock of shared variables, so that each CPU can access data independently, regardless of other CPUs. The disadvantage is that it consumes a lot of memory, and not all variables can be made per-CPU. Parallelism is the goal of multi-core programming, while serialization is the biggest harm in multi-core programming. The parallel and serial topics will also be mentioned in system-level optimization.
Local variables must be Thread Local, so in multi-core programming, local variables are more advantageous.
14) 64 bits counter in 32 bits environment (64-bit counter in 32-bit environment), recommendation index: 5 stars
Using 64-bit counter in a 32-bit environment will obviously affect the performance, so unless necessary, it is best not to use it. For more information about counter optimization. Counter is required, but you need to make careful choices to avoid repeated counting. Counter in the Key Path can use the per-CPU counter, and the non-Key Path (exception path) can save a little memory.
15) Reduce call path or call trace (reduce the hierarchy of function calls), recommendation index: 4 stars
The more functions there are, the less useful things there will be (the function's inbound stack, outbound stack, etc ). Therefore, we need to reduce the calling level of the function. But it should not damage the beauty and readability of the program. I personally think that the primary criterion for a good program is appearance and readability. A program that is not easy to read affects the mood. Therefore, we need to weigh the advantages and disadvantages. We cannot use a program as a function.
16) Move exception path out (put Exception Processing in another function). recommended index: 5 stars
When the exception path and critical path are put together (the code is mixed together), the cache performance of the critical path will be affected. In many cases, the exception path is a long article, which is a bit overwhelming. If we can completely separate the critical path and exception path, this will be of great help to I-cache.
17) read, write split (read/write splitting), recommendation index: 5 stars
In cache.pdf, the pseudo-sharing (false sharing) is mentioned, that is, two irrelevant variables, one read and one write, and the two variables are in a cache line. Therefore, writing will result in cache line failure (usually in multi-core programming, two variables are referenced on different cores ). Read/write splitting is a very difficult technique, especially when the code is very complex. Continuous debugging is required. It is a kind of effort (it will be better if tools are available, such as CPU execption processing triggered when cache miss occurs ).
18) reduce duplicated code (reduce redundant code), recommendation index: 5 stars
There are many redundant code and dead code (Dead Code) in the code. Reducing redundant code is to reduce waste. However, redundant code is sometimes necessary (copy-paste is too large, and the end is too big to be changed). However, it is necessary to spend some effort on the critical path.
Sometimes code optimization conflicts with programming rules, such as directly accessing member variables or using interfaces. In programming rules, access through interfaces is required, but direct access is more efficient. There is also a lot of code such as assert, which is added more, but also affects the performance, but without it, it will bring trouble to debug. Therefore, we need to weigh. Optimization at the code level is a basic task, but it is no doubt that optimization at the code level can solve all problems. Considering problems at the system level and algorithm level, the results may be better.
Optimization at the code level requires the cooperation of related tools. If there is no tool, it will be more than half the effort. Therefore, prepare the tool before optimization. I will talk about tools in another article.
What else do you need to think about. These optimization techniques are related to the C language. Not necessarily applicable to other languages. Each language has some performance-related coding specifications and conventions. There are a lot of books in the tive and predictional series. You can check them out.
The focus of Code-related optimization is to look at the code and think more about it.
References:
1) http://en.wikipedia.org/wiki/CPU_cache
2) Valid tive C ++: 55 specific ways to improve your programs and designs (3rd edition)
3) More efficient C ++: 35 new ways to improve your programs and designs
4) Valid STL: 50 specific ways to improve your use of the standard template library
5) Valid tive Java (2nd edition)
6) Valid tive C # (covers C #4.0): 50 specific ways to improve your C # (2nd edition) (valid tive Software Development Series)