Program Performance Optimization

Source: Internet
Author: User
Tags mathematical functions valgrind

In the era of expensive hardware resources, programmers pay great attention to program performance and expect as many hardware resources as possible. With the development of technology,
Moore's law makes hardware resources cheaper and faster. It seems that performance is no longer a concern for programmers. However, in a competition
In the era of development and development, software functions become more and more complex, user operation experience becomes more and more important, and competition becomes increasingly fierce.
Who will win in the fierce competition for experience and more complex things. Therefore, the performance optimization of software will always be one of the focuses of the software field.
Although the software performance optimization runs through the whole process of design and coding, this article also analyzes the performance optimization from two levels: design and coding. This article will also
Describes the performance problem analysis process in four aspects: storage, disk, and network.

2. Designed Performance
1) System Architecture
Control Flow and data flow? Reduce unnecessary modules
2) program structure
Multi-threaded Program
Lock granularity, performance comparison of various Locks/semaphores
Shared Memory Communication
Reduces flexibility for high performance.
Reduce unnecessary duplicate judgments (shttp/HTTP)
3) Interface Design
Good interfaces give users full flexibility
4) data structures and algorithms
Linux memory management, using linked lists for hours

3. The art of coding
1) memory access and files
Reduce new/delete or malloc/free operations to reduce page feed
Reduce file opening and closing operations
Reduce the number of file reads and writes (reduce system calls)
2) reduce unnecessary operations
Eliminate repeated operations
Computation in a loop
Put the busiest cycle in it
3) Utilization of language and library function features
If and case statements
Structure and Structure
Macros and inline functions
Slow computing
Reduce temporary variables
Cache String Length
Unnecessary memset
4) Utilization of hardware features
Byte alignment
Shift and multiplication and division 2
Performance hotspot implemented by Assembly
4. Performance analysis tool-callgrind
Valgrind is the most common tool in Linux because it is free of charge. Callgrind is a member of the valgrind tool. Its main function is to simulate the CPU cache, which can compute multiple
Level cache validity, number of failures, and time consumption of each function call.
The implementation mechanism of callgrind (based on external interruptions) determines that it has many shortcomings. For example, the program may be severely slowed down, highly optimized programs are not supported, and the results of time-consuming statistics may have a large error,
More external tools include oprofile, GPROF, tprof, rational quannetworking and Intel vtune.
5. compiler Parameter Optimization
Remember that the compiler is definitely much more powerful than you think. Most people who write compilers are scientists who have ten or decades of coding experience! You can simply think of them, they already think
It's over. A common compiler supports most known optimization policies and multimedia commands. Most people think about which compiler is better: Intel. Intel is the best
CPU provider. Their compiler considers many CPU features and runs faster. However, the intel compiler currently has some mental limitations, that is, it only recognizes its own CPU, not its own CPU,
It is considered to be the worst i386-i686 machine, and thus cannot support SSE on AMD and other platforms. We write code on Linux and generally prefer popular compilers, such as GCC.
The advantage of GCC is that it is fast to update, open source, and fast to modify bugs. Because it is updated fast, it can support some C03 specifications.
5.1 optimization technology supported by GCC
1) function inline
The function call process is: press the parameter to the stack, protect the site, call the function code, and restore the site. When a function is called in large quantities, the overhead of function calling is extremely huge. Function
Linking means removing all these overhead and directly calling the code. It is difficult to debug function inline because the function does not actually exist.
2) constant pre-Calculation
A = B + 1000*8
For this code, the program will calculate B + 1000*8 in advance, and then become:
A = B + 8000
3) Same substring Extraction
A = (B + 1) * (B + 1)
Here, B + 1 needs to be calculated twice and can be calculated only once:
TMP = B + 1
A = TMP * TMP
4) Lifecycle Analysis
This is a relatively advanced technology. Suppose there is code:
A = B + 1
C = a + 1
During execution, because the second sentence depends on the first sentence, the second sentence is linear execution.
But the compiler actually knows that C is equal to B + 2, so the code becomes:
A = B + 1
C = B + 2
In this way, the two sentences are irrelevant. During execution, the CPU can execute them in parallel.
5) Clear the jump
See the following code:
Int func ()
{
Int ret = 0;
If (XXX)
Ret = 5;
Else if (yyy)
Ret = 6;
Return ret;
}
When the condition XXX is met, the program will jump to the following execution, but it is not necessary. The compiler will change it:
Int func ()
{
If (XXX)
Return 5;
Else if (yyy)
Return 6;
}
6) loop Expansion
A loop consists of several parts: Counter assignment, Calculator comparison, and jump. Each cycle is required in the next two steps. Copying multiple copies of the Code in the loop can be greatly reduced
The number of cycles saves the time consumed in the next two steps. Refer:
For (INT counter = 0; counter <4; count ++)
Xxx;
It can be changed:
Xxx;
Xxx;
Xxx;
Xxx;
The compiler can not only expand normal loops, but also expand recursive functions. The principle is the same. recursion is actually an indefinite cycle borrowed from the stack.
7) Remove constants in a loop
For (INT idx = 0; idx <100; idx ++)
A [idx] = A [idx] * B * B;
Because the value of B * B in the loop body is fixed (constant), the code can be changed:
TMP = B * B;
For (INT idx = 0; idx <100; idx ++)
A [idx] = A [idx] * TMP;
8) Parallel Computing
As we all know, modern CPUs support hyper-pipeline technology and can execute multiple statements at the same time. The limit on whether multiple statements can be executed simultaneously cannot be mutually dependent. The compiler will automatically help us
The Code executed by a single thread turns into parallel computing. For details, refer:
D = A + B;
E = a + D + F;
It can be changed:
TMP = a + F;
D = A + B;
E = d + TMP;
9) Simplified expression
When I was studying discrete mathematics and digital circuits, I was always confused about how to simplify Boolean operations. GCC finally gave me a sigh of relief. Refer:
! A &&! B
This statement takes three steps, but becomes:
! (A | B)
Only two steps are required.
5.2 important GCC Optimization Options
1) inline
-Finline-small-Functions
Inline smaller functions. -The O2 option can be enabled.
-Findirect-inlining
Indirect inline: multiple levels of function calls can be inline. -The O2 option can be enabled.
-Finline-Functions
Inline all functions that can be inline. The-O3 option can be enabled.
-Finline-Limit = N
The minimum code length of a function that can be inline. Note: Here is the pseudo code, not the actual code length. Pseudocode is the code processed by the compiler. Functions With inline and other flags, default
300 lines of code can be inline, without the default 50 lines of code. The related options are Max-inline-insns-single and max-inline-insns-auto.
Max-inline-insns-recursive-auto
Maximum code length of an inline recursive function.
Large-function-insns, large-function-growth, large-unit-insns, etc.
The side effect of function inline is that it leads to more code and longer program. The several parameters here can control the total length of the Code to avoid huge programs after compilation, affecting performance and wasting resources.
2)-fomit-frame-pointer
The standard EBP is not used to record the stack pointer, saving a register and shorter code. However, it is said that some machines may cause debug mode errors. The actual test shows that in gcc4.2.4
Both O2 and O3 cannot enable this option.
3)-fwhole-Program
Use the code as a final program to compile, that is, explicitly specifying that the code is not a compilation library. At this time, the compiler can use more static variables to speed up the program.
4) mmx/ssex/avx
Multimedia commands mainly support vector computing. Generally,-March = i686,-MMX,-MSSE, and-msse2 are commands currently supported by machines.
In addition to the basic multimedia support, the GCC compiler also supports-ftree-vectorize. This option tells the compiler to automatically perform vectoring and is also supported by-O3.
A few more words. In normal use, multimedia commands are not very common (unless the game is used ). If you have several bitsets that require a variety of bit operations, multimedia commands are still effective.
5.3 GCC kill-profile driven optimize
This is a technology that appears late. The basic principle is to shorten the hot path length according to the actual running conditions. The compiler monitors program running by adding various counters and then calculates
To analyze the Hot Path and shorten its length. According to GCC developers, this technology can improve the running efficiency by 20-30%.
The usage is as follows:
Compile the code and add the-fprofile-generate option.
To the official environment for a while
When the program exits, an analysis file is generated.
Use this analysis file and add-fprofile-use to re-compile the program.
For example:
A = B * 5;
If B is often equal to 10 after compilation, the code can be changed:
A = 50;
If (B! = 10)
A = B * 5;
In most cases, multiplication is avoided.
5.4 gcc-supported optimization attributes (_ attribute __)
Aligned
You can set the alignment to 64 bytes, which is consistent with the CPU cache line.
Fastcall
If the first two parameters of a function call are of the integer type, this option can be used to pass parameters using registers, rather than using the conventional stack.
Pure
A function is a pure function. At any time, the same input will have the same output. It can be easily optimized based on probability.
5.5 other GCC optimization technologies
# Pragma pack ()
Alignment to one byte, saving memory
_ Builtin_expect
Directly tell the compiler expression the most likely result for Optimization
Compile a small file with debug information
The following code can greatly reduce the size of the compiled program and retain the debug information. The principle is that the external link is a version with debug.
G ++ TST. cpp-g-O2-pipe
Copy A. Out A. GDB
Strip -- strip-Debug a. Out
Objcopy -- add-GNU-debuglink = A. gdb a. Out
6. algorithms are the core
Algorithms are the core of a program. The quality of a program is mostly dependent on the quality of the algorithm. For general programmers, we cannot change the calls of systems and libraries.
Rules to use them, we can change the algorithm of our core code.
The algorithm can improve program performance by ten or even hundreds of times. For example, in fast sorting and Bubble sorting, the latter is several thousand times slower than the former in 10 million scale.
Generally, there are algorithms in the following fields:
A) common data structures and algorithms
B) Input and Output
C) memory operations
D) string operations
E) encryption, decryption, compression, and decompression
F) mathematical functions
In general, performance problems are usually reflected in four aspects: CPU, memory, disk, and network. The solution can be to modify the code or program structure to make full use of existing resources,
You can also add hardware to increase the supply of resources.

Program Performance Optimization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.