Transferred from: http://www.jinbuguo.com/linux/optimize_guide.html
Copyright notice
The writer is a staunch supporter of the open source concept, so although this article is not software, it is published in the spirit of open source.
- No warranty: The author does not warrant that the contents of the work are accurate or that any loss caused by the use of this document is not warranted.
- Free to use: anyone is free to read/link/print This document without any additional conditions.
- Honorary rights: Anyone is free to reprint/quote/re-author This document, but must retain the author's signature and indicate the source.
Other works
The author is very willing to share the fruits of the work with others, if you are interested in my other translation works or technical articles, you can view the existing folios in the following places:
- Jin Bu Collection [http://www.jinbuguo.com/]
Contact information
Due to the limited author level, the content of the work cannot be guaranteed. If you find a mistake in your work (even if it's a typo), please write to me that any suggestions to improve the quality of your work will be accepted.
- Email (QQ): 70171448 in QQ mailbox
Objective
There are many articles about compiling optimization on the internet, but most fragmented, fragmentation system, this article attempts to give a complete and clear optimization idea, at the same time provide in practice how to optimize the detailed reference. However, before introducing all optimization knowledge, one of the caveats in Lfs-book is quoted first: "A small performance gain with compiler optimizations is negligible compared to the risk it poses". Do you want to optimize it?
%@&#=^%~*# ...
OK, Crazy guy! Let ' s go!!
Before continuing, the author advises you: If you pursue extreme optimization, it will be a time-consuming and troublesome thing, and you will be stuck in endless testing, testing, and testing ... In addition to the Gentoo wiki there is such a sentence: "GCC has well over a hundred individual optimization flags and it would be insane to try and DESCRI be them all. " So this article does not cover all GCC optimization options. Finally, the author is still a word: optimization should be enough to be good, to leave the energy to do some other things will be more meaningful!
Prerequisite
The main reader of this article is Lfs/gentoo players, basically compared to crazy players have contact, if you have never used Lfs/gentoo before, please follow the "Linux from Scratch 6.2 Chinese version" To do the LFS, and then read this article will be more meaningful 。 In addition, this article is based on the article "in-depth understanding of the configuration, compilation and installation of software packages," Before you begin reading this article, read it first.
Basic principle
We first look at the content of optimization in three ways:
- From the run-time dependencies, the components that have a greater impact on performance are kernel and glibc, although this is strictly not the topic of this article, but carefully selected, carefully configured, well-compiled kernel and C library will improve the speed of the system plays a fundamental role.
- From the compiled package, the Configure script for each package provides many configuration options, many of which are related to performance. For example, for Apache-2.2.6, you can use--enable-module=static to compile the module statically into the core, use--disable-module to disable unnecessary modules, and use the--WITH-MPM=MPM Choose an efficient multi-processing module that disables IPV6 support with--disable-ipv6 without the need for IPV6, using--disable-threads to disable threading support when not using threaded MPM, etc... This part of the content is clearly not possible in this article to complete the narrative, this article can only describe the optimization related to the common options. For a specific package, use configure--HELP to see all options and select them carefully before compiling.
- From the compiling process itself, compiling the source code into a binary file is done by the make program invoking a compile command under the guidance of the Makefile file. Compiling the source code into a binary file requires the following four steps: Preprocessing (CPP) → compilation (GCC or g++) → Assembly (as) → connection (LD), which represents the programs used in each phase, which belong to the GCC and Binutils packages, respectively. Obviously, optimization should start with the choice of the compiler tool itself and the behavior of the control compiler tool.
In general, the compilation optimization of this "kick" (in fact, "three-legged Cat"), the next part of this article will discuss the cat's two feet.
Selection of compilation tools
For the compiler's own choice, assuming the use of Binutils and GCC and make, there is nothing to say, basically the new version can bring performance gains, and the old version of the new hardware support better, so should try to choose the new version. But the pursuit of new may also lead to system instability, which will be based on the actual balance. This article takes Binutils-2.18 and gcc-4.2.2/gcc-4.3.0 and Make-3.81 as examples to illustrate.
Configure options
Here we will cover only the generic "architecture options", because "feature options" vary from one package to another, so it is not possible to explain here.
This part is very simple, and its meaning is self-evident, the following list only the common values:
- I586-pc-linux-gnu
- I686-pc-linux-gnu
- X86_64-pc-linux-gnu
- Powerpc-unknown-linux-gnu
- Powerpc64-unknown-linux-gnu
If you really do not know which one to use, then simply do not use these options, let the config.guess script to guess it, anyway is quite accurate.
Compilation options
Let's start by looking at how the compiler commands in the Makefile rule are usually written.
Most packages adhere to the following conventions:
#1, the target file (preprocessing, compilation, compilation) is generated from the source code first, and the "-C" option indicates that the link step is not performed. $ (CC) $ (cppflags) $ (CFLAGS) example.c- C- o example.o#2, and then connect the destination file as the final result (connection), the "-o" option to specify the name of the output file. $ (CC) $ (ldflags) EXAMPLE.O- o example# Some packages complete four steps at a time: $ (CC) $ (cppflags) $ (CFLAGS) $ (ldflags) example.c- o Example
Of course, there are a few packages that do not comply with these conventions, such as:
#1, some of the expected makefile variables are omitted from the command line (note: Some omissions are intentional) $ (cc) $ (CFLAGS) example.c- C- o example.o$ (CC) $ (cppflags) example.c - C- o example.o$ (cc) EXAMPLE.O-o example$ (cc) example.c- o example#2, some add unnecessary makefile variables in the command line $ (cc ) $ (CFLAGS) $ (ldflags) EXAMPLE.O- o example$ (CC) $ (cppflags) $ (CFLAGS) $ (ldflags) example.c- C- o EXAMPLE.O
Of course, there are very individual software packages are completely "nonsense": the use of variables (increase unnecessary and missing the due) have, not the $ (CC) of those who have, and so on .....
Although the four steps of compiling the source code into a binary file are done by different programs (CPP,GCC/G++,AS,LD), the fact that CPP, as, and LD are indirectly called by gcc/g++. In other words, controlling the gcc/g++ is equivalent to controlling all four steps. From the compile command in the Makefile rule, it can be seen that the compiler tool behaves by Cc/cxx cppflags cflags/cxxflags ldflags These variables are controlled. Of course, in theory control the compiler tool behavior should also have as asflags arflags variables, but in practice there are basically no packages to use them.
So how do we control these variables? It is a simple practice to first set up environment variables with the same name as these Makefile variables and export them to global, then run the Configure script, and most configure scripts will use the same name environment variable instead of the value in Makefile. But a few configure scripts do not (for example, GCC-3.4.6 and Binutils-2.16.1 scripts do not pass Ldflags), you must manually edit the generated Makefile files, find them in them, and modify their values. Many source packages have Makefile files in each subfolder, which is a very tiring thing!
CC and CXX
This is the C and C + + compiler command. The default values are typically "gcc" and "g++". This variable has nothing to do with optimization, but some people fear that the software package does not conform to those rules, the cflags/cxxflags/ldflags and other variables are ignored, and simply put the option that should have been placed in other variables an old son stuffed into CC or CXX, such as: cc= "Gcc-march=k8-o2-s". This is a weird usage, this article does not advocate this approach, but advocates the use of variables according to the original meaning of variables.
Cppflags
This is an option for the preprocessing phase. However, the option that can be used for this variable does not show which one is related to optimization. If you really want to set one, then use the following two:
-
-dndebug
-
"Ndebug" is a standard ANSI macro, which means that debugging is not compiled.
-
-d_file_offset_bits=64
-
Most packages use this to provide large file (>2G) support.
CFLAGS and Cxxflags
CFLAGS represents an option for the C compiler, cxxflags represents an option for the C + + compiler. These two variables actually cover the two steps of compiling and assembling. Most programs and libraries at compile time the default optimization level is "2" (using the "-o2" option) and with debug symbols to compile, that is, cflags= "-o2-g", cxxflags= $CFLAGS. In fact, "-O2" has enabled the vast majority of security optimization options. On the other hand, since most options can be used for both variables at the same time, only the option to use one of these variables is described at the end. [Reminder] The options listed below are non-default options, you just need to add them as needed.
First of all, "-o3" on the "-o2" based on the addition of several items:
-
-finline-functions
-
It is recommended to allow the compiler to select some simple functions to expand in its use, and to be more secure, especially if the CPU level two cache is large.
-
-funswitch-loops
-
Moves a variable that does not change the value in the loop body outside the loop body.
-
-fgcse-after-reload
-
in order to clear excess overflow, an additional load elimination step is performed after overloading.
Other than that:
-
-fomit-frame-pointer
-
A function that does not require a stack pointer does not hold the pointer in the register, so the code that stores and retrieves the address can be ignored, while providing an additional register for many functions. All the "-O" levels open, but only if the debugger can run without relying on the stack pointer. This option is turned on by default on the AMD64 platform, but is turned off by default on the x86 platform. It is recommended that you set it explicitly.
-
-falign-functions=n
-falign-jumps=n
-falign-loops=n
-falign-labels=n
-
These four alignment options are opened in "-o2", where different default values are used depending on the platform N. If you want to specify n that differs from the default value, you can specify it separately. For example, specifying-FALIGN-FUNCTIONS=64 may achieve better performance for l2-cache>=1m CPUs. It is recommended that you do not explicitly specify the value here when-march is specified.
Debugging options:
-
-fprofile-arcs
-
after you compile the program with this option and run it to create a file that contains the number of executions per block, the program can be compiled again using-fbranch-probabilities, where the information in the file can be used to optimize the branches that are frequently selected. Without this information, GCC will guess which branch will be run frequently for optimization. This type of optimization information will be stored in a file that is named after the source file and is suffixed with ". Da".
Global options:
-
-pipe
-
using pipelines rather than temporary files to communicate between different stages of the compilation process can speed up compilation. Recommended for use.
Directory options:
-
--sysroot=dir
-
use Dir as the logical root directory. For example, compilers typically search for header files and libraries in/usr/include and/usr/lib, and use this option to search in the Dir/usr/include and Dir/usr/lib directories. If you use this option with the-isysroot option, this option only works on the search path for the library file, and the-isysroot option works on the search path of the header file. This option has nothing to do with optimization, but it has a magical effect in CLFS.
Code generation options:
-
-fno-bounds-check
-
Closes the bounds check for all log group accesses. This option will improve the performance of the array index, but may cause unacceptable behavior when the array bounds are exceeded.
-
-freg-struct-return
-
If the struct and union are small enough to be returned through the register, this will increase the efficiency of the smaller structure. If it is not small enough to fit in a register, it will be returned using memory. It is recommended to use only on systems that are fully compiled with GCC.
-
-fpic
-
Generate location-independent code that can be used for shared libraries. All internal addressing is done through the global offset table. To determine an address, you need to insert the memory location of the code itself as an item in the table. This option produces a target module that can be stored in a shared library and loaded from it.
-
-fstack-check
-
It is only possible to run in a multithreaded environment to prevent the necessary detection of a stack overflow.
-
-fvisibility=hidden
-
sets the visibility of symbols in the default Elf image to be hidden. Using this feature can greatly improve the performance of connecting and loading shared libraries, generating more optimized code, providing near-perfect API output and preventing symbol collisions. We strongly recommend that you use this option when compiling any shared libraries. See-fvisibility-inlines-hidden options.
Hardware architecture-related options [for x86 and x86_64 only]:
-
-
-march=cpu-type
-
-
compiles binary code for a specific Cpu-type (cannot run on a lower-level CPU). Intel can be used: pentium2, pentium3 (=pentium3m), Pentium4 (=pentium4m), Pentium-m, Prescott, Nocona, Core2 (GCC-4.3 new). AMD can be used: k6-2 (=k6-3), Athlon (=athlon-tbird), Athlon-xp (=ATHLON-MP), K8 (=OPTERON=ATHLON64=ATHLON-FX)
-
-
-mfpmath=sse
-
-
P3 and ATHLON-XP levels and above support "SSE" scalar floating point instructions. This option is only recommended for processors above the P4 and K8 levels.
-
-
-malign-double
-
-
a double, long double, long Long is aligned on a double-byte boundary, and helps generate faster code, but the size of the program becomes larger and cannot work with programs that are not compiled with the option.
-
-
-m128bit-long-double
-
-
specifies a long double of 128 bits, which is preferred by CPUs above Pentium and complies with the x86-64 ABI Standard, but does not comply with the I386 ABI standard.
-
-
-mregparm=n
-
-
Specifies the number of registers used to pass integer parameters (default does not use registers). 0<=n<=3; Note: When n>0, you must rebuild all the modules with the same parameters, including all libraries.
-
-
-msseregparm
-
-
use the SSE register to pass the float and double parameters and return values. Note: When you use this option, you must rebuild all the modules with the same parameters, including all the libraries.
-
-
-mmmx
-msse
-msse2
-msse3
-m3dnow
-MSSSE3 (not written wrong!) GCC-4.3 new)
-msse4.1 (GCC-4.3 New)
-msse4.2 (GCC-4.3 New)
-msse4 (New with 4.1 and 4.2,gcc-4.3)
-
-
whether to use the corresponding extension instruction set and built-in functions, according to their own CPU selection bar!
-
-
-maccumulate-outgoing-args
-
-
Specifies the maximum space required to calculate output parameters in the function boot segment, which is a faster method in most modern CPUs, with the disadvantage of significantly increasing the size of the binary file.
-
-
-mthreads
-
-
supports thread-safe exception handling for MINGW32. This option must be enabled for programs that rely on thread-safe exception handling. Using this option defines "-D_MT", which will contain a special thread-assist library that uses the option "-LMINGWTHRD" connection to clean up exception-handling data for each thread.
-
-
-minline-all-stringops
-
-
by default, GCC will only determine that the destination will be aligned within the string operation of at least 4-byte bounds in the program code. This option enables more inline and increases the volume of the binary file, but can improve the performance of programs that rely on high-speed memcpy, strlen, memset operations.
-
-
-minline-stringops-dynamically
-
-
GCC-4.3 added. A small block operation on an unknown dimension string uses inline code, while a library function is still called on a chunk operation, which is a smarter strategy than "-minline-all-stringops". The algorithm that determines the policy can be controlled by "-mstringop-strategy".
-
-
-momit-leaf-frame-pointer
-
-
It is not possible to save the register by saving the stack pointer in the register for the leaf function, but it will make debugging difficult. Note: Do not use with-fomit-frame-pointer at the same time, because it can cause code inefficiency.
-
-
-m64
-
-
generate code that is specifically run in a 64-bit environment and cannot be run in a 32-bit environment, only for x86_64[with EMT64] environments.
-
-
-mcmodel=small
-
The
-
[default] program and its symbols must be in an address space below 2GB. The pointer is still 64 bits. Programs can be statically connected or dynamically connected. Only for x86_64[with EMT64] environment.
-
-
-mcmodel=kernel
-
-
the kernel runs outside of the 2GB address space. This option must be used when compiling the Linux kernel! Only for x86_64[with EMT64] environment.
-
-
-mcmodel=medium
-
The
-
program must be in an address space below 2GB, but its symbols can be in any address space. Programs can be statically connected or dynamically connected. Note: Shared libraries cannot be compiled with this option! Only for x86_64[with EMT64] environment.
Other Optimization options:
-
-
-fforce-addr
-
-
addresses must be copied to registers in order to perform operations on them. This option can improve the code because the required address is usually already loaded in the register before.
-
-
-finline-limit=n
-
-
for a function that has more than n pseudo-directives, the compiler will not expand inline, which defaults to 600. Increasing this value will increase compilation time and compile memory usage and the resulting binary file volume will also become larger, and this value should not be too large.
-
-
-fmerge-all-constants
-
-
an attempt was made to merge all constant values and arrays across a compilation unit into a single copy. However, standard C + + requires that each variable must have a different storage location, so this option may result in some incompatible behavior.
-
-
-fgcse-sm
-
-
runs the storage move after the global common subexpression is eliminated in an attempt to move the storage out of the loop. An option that was previously part of the "-o2" level in gcc-3.4.
-
-
-fgcse-las
-
-
eliminates redundant load operations after storing to the same storage area after the global common subexpression is eliminated. An option that was previously part of the "-o2" level in gcc-3.4.
-
-
-floop-optimize
-
-
abolished (GCC-4.1 was included in "-o1").
-
-
-floop-optimize2
-
-
Replace the original "-floop-optimize" with an improved version of the Loop optimizer. The optimizer will use different options (-funroll-loops,-fpeel-loops,-funswitch-loops,-ftree-loop-im) to control the different aspects of the cycle optimization, respectively. Currently this new version of the optimizer is still in development, and the resulting code quality is no higher than the previous version. has been revoked and exists only in versions prior to GCC-4.1.
-
-
-funsafe-loop-optimizations
-
-
assume that the loop does not overflow, and that the exit condition of the loop is not infinite. This will allow for a wide range of loops to be optimized, even if the optimizer cannot determine if it is correct.
-
-
-fsched-spec-load
-
-
Some loading instructions are allowed to perform some speculative actions.
-
-
-ftree-loop-linear
-
-
line loop conversion is performed on the trees. It improves buffering performance and allows for further cycle optimization.
-
-
-fivopts
-
-
perform inductive variable optimization on the trees.
-
-
-ftree-vectorize
-
-
perform cyclic vectorization on the trees.
-
-
-ftracer
-
-
performing a tail copy to enlarge the size of the Super block simplifies the function control flow, allowing other optimizations to do better. It is said to be very effective.
-
-
-funroll-loops
-
Only the loops that
-
can be determined at compile time or runtime are expanded, resulting in larger code sizes, which can be faster or slower.
-
-
-fprefetch-loop-arrays
-
-
generates an array of pre-read directives, which can speed up code execution for programs that use large arrays, such as large database-related software. How the specific effect depends on the code.
-
-
-fweb
-
-
establish a frequently used cache network, providing better cache usage. An option that was previously part of the "-o3" level in gcc-3.4.
-
-
-ffast-math
-
-
violating the Ieee/ansi standard to increase the speed of floating-point calculations is a risky option, only if the compilation does not require strict adherence to the IEEE specification and the floating-point computation-intensive procedures are considered.
-
-
-fsingle-precision-constant
-
The
-
floating-point constant is treated as a single-precision constant instead of being implicitly converted to double precision.
-
-
-fbranch-probabilities
-
-
after compiling the program with the-FPROFILE-ARCS option and executing it to create a file containing the number of executions per block of code, the program can compile again with this option, and the information generated in the file will be used to optimize the branch code that often occurs. Without this information, GCC would guess that the branch might occur frequently and be optimized. This type of optimization information will be stored in a file that is named after the source file and is suffixed with ". Da".
-
-
-frename-registers
-
-
trying to get rid of false dependencies in your code, this option works well for machines with a large number of registers. An option that was previously part of the "-o3" level in gcc-3.4.
-
-
-fbranch-target-load-optimize
-fbranch-target-load-optimize2
-
-
performs a branch target cache load optimization before execution starts and ends.
-
-
-fstack-protector
-
-
set the protection value in the stack of the key function. This protection value is validated before the return address and return value. If a buffer overflow occurs, the protection value no longer matches and the program exits. Each time the program runs, the protection values are random and therefore not remotely guessed.
-
-
-fstack-protector-all
-
-
ditto, but set the protection value in the stack of all functions.
-
-
--param MAX-GCSE-MEMORY=XXM
-
-
perform the maximum amount of memory (XXM) used by the GCSE optimization, so that the optimization is not possible and the default is 50M.
-
-
--param max-gcse-passes=n
-
-
the maximum number of iterations to perform the GCSE optimization, which defaults to 1.
Options passed to the assembler:
-
-wa,options
-
Options is one or more comma-delimited list of choices that can be passed to the assembler. Each of these can be passed as a command-line option to the assembler.
-
-wa,--Strip-local-absolute
-
Removes the local absolute symbol from the output symbol table.
-
-wa,-r
-
Merging data and body segments because it does not have to be transferred between the data segment and the code snippet, it may result in a shorter address movement.
-
-wa,--64
-
Set the word length to 64bit, only for x86_64, and only valid for target files in elf format. In addition, BFD support compiled with the "--ENABLE-64-BIT-BFD" option is also required.
-
-wa,-march=cpu
-
optimized for specific CPUs: PENTIUMIII, Pentium4, Prescott, Nocona, Core, Core2, Athlon, Sledgehammer, Opteron, K8.
Options available for CFLAGS only:
-
-fhosted
-
Compiled by the hosting environment, where a complete standard library is required, and the entry must be the main () function with the return value of type int. This is true for almost all programs outside the kernel. This option implicitly sets the-fbuiltin and is equivalent to-fno-freestanding.
-
-ffreestanding
-
compiled by a standalone environment, the environment can have no standard libraries and is not required for the main () function. The most typical example is the operating system kernel. This option implicitly sets the-fno-builtin and is equivalent to-fno-hosted.
Options available for Cxxflags only:
-
-fno-enforce-eh-specs
-
The C + + standard requires mandatory check for exception violations, but this option can turn off violation checking, reducing the volume of generated code. This option is similar to defining a "NDEBUG" macro.
-
-fno-rtti
-
If you do not use ' dynamic_cast ' and ' typeid ', you can save space by using this option to disable generating runtime representation code for classes that contain virtual methods. This option is not valid for exception handling (still generates RTTI code on demand).
-
-ftemplate-depth-n
-
Set the maximum template instantiation depth to ' n ', the standard-compliant program cannot exceed 17, and the default value is 500.
-
-fno-optional-diags
-
Disables the output of diagnostic messages, which are not required by the C + + standard.
-
-fno-threadsafe-statics
-
GCC automatically locks the code that accesses C + + local static variables to ensure thread safety. If you don't need thread safety, you can use this option.
-
-fvisibility-inlines-hidden
-
by default, all inline functions are hidden, reducing the size of the exported symbol table, reducing the size of the file and improving performance, and we strongly recommend that you use this option when compiling any shared libraries. See-fvisibility=hidden options.
Ldflags
Ldflags is the option to pass to the connector. This is a variable that is often overlooked, and in fact its effect on optimization is obvious.
-
-
-S
-
-
deletes all symbol tables and all relocation information in the executable program. The result is the same as the effect of running the command strip, which is more secure.
-
-
-wl,options
-
-
options are a list of choices that are passed to the linker, separated by one or more commas. Each of these options is provided to the linker as a command-line option.
-
-
-wl,-on
-
The output will be
-
optimized when n>0, but will significantly increase the time of the connection operation, which is more secure.
-
-
-WL,--Exclude-libs=all
-
The symbols in the
-
library are not automatically exported, and the symbols in the library are hidden by default.
-
-
-wl,-m<emulation>
-
-
emulation <emulation> Connector, all available simulations for the current LD can be obtained through the "ld-v" command. The default value depends on the compile-time configuration of the LD.
-
-
-WL,--Sort-common
-
The
-
global public symbols are sorted by size and placed in the appropriate output section to prevent gaps between symbols due to arrangement restrictions.
-
-
-wl,-x
-
-
Delete all local symbols.
-
-
-wl,-x
-
-
Delete all temporary local symbols. For most target platforms, this is the local symbol with all names starting with ' L '.
-
-
-wl,-zcomberloc
-
-
combine multiple relocation sections and rearrange them to allow dynamic symbols to be cached.
-
-
-WL,--Enable-new-dtags
-
-
Create a new "dynamic tags" in the elf, but it is not recognized on the old elf system.
-
-
-WL,--as-needed
-
-
you can generate more efficient code by removing unnecessary symbol references and connecting only when you actually need it.
-
-
-WL,--No-define-common
-
-
restricts the allocation of addresses to ordinary symbols. This option allows normal symbols that are referenced from shared libraries to be assigned addresses only in the main program. This eliminates the space for useless replicas in the shared library, and also prevents the clutter that can arise when a dynamic module with multiple specified search paths is run-time symbolic resolution.
-
-
-WL,--Hash-style=gnu
-
-
use the GNU-style symbolic hash list format. Its dynamic link performance is significantly higher than the traditional SYSV style (default), but it generates executable programs and libraries that are incompatible with old glibc and dynamic linker.
Finally, there are two system environment variables that are not related to optimization, because it will affect the way GCC compiles the program, the following two are the Chinese people are more concerned about:
-
Lang
-
Specifies the character set used by the compiler to create wide-character files, string literals, comments, and the default is English. [Currently only supports Japanese "C-JIS,C-SJIS,C-EUCJP", does not support Chinese]
-
Lc_all
-
Specifies the character classification of multibyte characters, which is used primarily to determine the character boundaries of a string and which language the compiler uses to emit diagnostic messages; The default setting is the same as Lang. Chinese-related items: "ZH_CN." GB2312, ZH_CN. GB18030, ZH_CN. GBK, ZH_CN. UTF-8, ZH_TW. BIG5 ".
GCC compilation Optimization Guide "Go"