This is a WWDC Session 406 study note, from the principle to the practice of how to optimize the App startup time.
APP Run theory
Theoretical accelerated Mach-o Terminology
Mach-o is the file type for different runtime executables.
File type: the
Executable: The primary binary of the application
Dylib: Dynamic link library (also known as DSO or DLL)
Bundles: Dylib that cannot be linked can only be loaded at run time dlopen()
and can be used as MacOS plugins.
Image:executable,dylib or Bundles
Framework: A folder containing Dylib and resource files and header files
Mach-o image File
Mach-o is divided into some segement, and each segement is divided into sections.
Segment names are uppercase, and the space size is the integer of the page. The size of the page is hardware-related, on the ARM64 schema one page is 16KB and the rest is 4KB.
section does not have an integer page size limit, but there is no overlap between sections.
Almost all mach-o contain these three segments (segment): __TEXT
, __DATA
and __LINKEDIT
:
__TEXT
Contains the Mach header, the code that is executed, and the read-only constant (such as the C string). Read-only executable (r-x).
__DATA
Contains global variables, static variables, and so on. Readable and writable (rw-).
__LINKEDIT
Contains the "metadata" of the loader, such as the name and address of the function. Read-only (r –).
Mach-o Universal File
FAT binaries, merging multiple schemas of mach-o files. It uses the FAT header to record the offset of the different schemas in the file, and the Fat header occupies one page of space.
Storing these segement and headers by paging can waste space, but this facilitates the implementation of virtual memory.
Virtual memory
Virtual memory is a layer of indirect addressing (indirection). There is a maxim in software engineering that any problem can be solved by adding an indirect layer. Virtual memory solves the problem of managing the use of physical RAM by all processes. By adding an indirection layer to allow each process to use the logical address space, it can be mapped to a physical page on RAM. This mapping is not one-to-one, the logical address may not be mapped to RAM, or there may be multiple logical addresses mapped to the same physical RAM. In the first case, the page fault is triggered when the process wants to store the logical address content, and the second case is multi-process shared memory.
The file can be read in the form of a paging map () without having to read the entire file at once mmap()
. That is, a fragment of a file is mapped to a page of process logical memory. When a page that you want to read is not in memory, it will trigger page fault, and the kernel will only read the page and implement lazy loading of the file.
This means that the segments in the Mach-o file __TEXT
can be mapped to multiple processes and can be lazy-loaded and share memory between processes. __DATA
the segment is readable and writable. Here the use of Copy-on-write technology, referred to as COW. When multiple processes share a single page of memory space, once a process is written, it copies the contents of the page memory and then re-maps the logical address to the new RAM page. That is, the process itself has a copy of that page of memory. This involves the concept of Clean/dirty page. The dirty page contains the process's own information, and the clean page can be regenerated by the kernel (reread the disk). So the cost of dirty page is greater than the clean page.
Mach-o Mirroring Loading
Therefore, when multiple processes load Mach-o mirrors __TEXT
and __LINKEDIT
because they are read-only, memory can be shared. and __DATA
because it can read and write, it will produce dirty page. When the Dyld execution is finished, __LINKEDIT
it is useless, and the corresponding memory pages are recycled.
Safety
ASLR (address space layout randomization): Location randomization, mirroring is loaded at random addresses. This is actually the old technology ten or twenty years ago.
Code Signing: Maybe we think Xcode will encrypt the entire file and make a digital signature. In fact, in order to verify the signature of the Mach-o file at runtime, it is not necessary to read the entire file every time, but to generate a separate cryptographic hash value for each page and store it in __LINKEDIT
. This allows the contents of each page of the file to be verified and not tampered with in a timely manner.
From
exec()
To
main()
exec()
is a system call. The system kernel maps the application to the new address space, and each start location is random (because of the use of ASLR). And the process permissions from the starting position to 0x000000
this range are marked as non-read and write non-executable. If it is a 32-bit process, this range is at least 4KB, and at least 4GB for 64-bit processes. Both the NULL pointer reference and the pointer truncation error are captured by it.
dyld
Load the Dylib file
Unix was at ease for the first 20 years because it had not yet invented a dynamic link library. With the dynamic link library, a helper to load the link library is created. On Apple's platform dyld
, there are other Unix systems as well ld.so
. When the kernel finishes working on the mapping process dyld
, it maps the name of the Mach-o file to a random address in the process, which sets the PC register to dyld
the address and runs. The work that runs in the dyld
app process is to load all the dynamic-link libraries that the app relies on, ready to run everything it needs, with the same permissions as the app.
The following steps make up dyld
the timeline:
Initializers, OBJC, Bind, Rebase, Load dylibs
Load Dylib
Gets the list of dependent dynamic libraries that need to be loaded from the header of the main execution file, and the header has already been mapped by the kernel. Then it needs to find each dylib and then open the file to read the file starting location and make sure it is the Mach-o file. The code signature is then found and registered to the kernel. Then use each segment in the Dylib file mmap()
. The Dylib file that the application relies on may be dependent on other dylib, so dyld
what is needed is a recursive collection of dynamic library lists. General applications load 100 to 400 dylib files, but most are system dylib, which are pre-computed and cached and loaded quickly.
Fix-ups
After all the dynamic-link libraries are loaded, they are only in separate states and need to be bound together, which is fix-ups. Code signing makes it impossible for us to modify the instructions so that one dylib can call another dylib. You need to add a lot of indirect layers.
The modern Code-gen is called Dynamic PIC (Position Independent code), which means that it can be loaded onto an indirect address. When a call occurs, Code-gen actually __DATA
creates a pointer to the callee in the segment, and then loads the pointer and jumps past.
So dyld
the thing to do is to fix (fix-up) pointers and data. There are two types of fix-up, rebasing and binding.
Rebasing and Binding
rebasing: Adjusting the pointer's pointing inside the mirror
Binding: Pointing the pointer to content outside the mirror
Information such as rebase and bind can be viewed from the command line:
Xcrun Dyldinfo-rebase-bind-lazy_bind Myapp.app/myapp
With this command, you can view all the fix-up. Rebase,bind,weak_bind,lazy_bind are stored in __LINKEDIT
segments and can be viewed by LC_DYLD_INFO_ONLY
looking at the offsets and sizes of various information.
It is recommended to use Machoview to see more convenient and intuitive.
The dyld
process of rebasing and Binding is briefly introduced from the source level.
ImageLoader
is a base class for loading executables, which is responsible for link mirroring, but does not care about the specific file format, because these are given to subclasses to implement. Each executable file will correspond to an ImageLoader
instance. ImageLoaderMachO
is a subclass that is used to load mach-o format files ImageLoader
, ImageLoaderMachOClassic
and ImageLoaderMachOCompressed
both inherit from ImageLoaderMachO
, respectively, to load those mach-o files that are in both __LINKEDIT
traditional and compressed formats.
Because there is a dependency between the dylib, so ImageLoader
many of the operations are recursive along the dependency chain, rebasing and Binding are no exception, respectively, recursiveBind()
and recursiveBind()
the two methods. Because it is recursive, it is called from the bottom up and the method is invoked doRebase()
doBind()
, so that the dependent dylib always executes rebasing and Binding before relying on its dylib. doRebase()
the arguments passed in and doBind()
contain a LinkContext
context that stores a stack of states and related functions for the executable.
The rebasing and Binding will determine whether the prebinding has been made before. If pre-binding (prebinding) is already in place, then the fix-up process of rebasing and binding is not required because the pre-bound address is already loaded.
ImageLoaderMachO
There are four reasons why an instance does not use a pre-binding :
The Mach-o Header MH_PREBOUND
is marked0
The image load address has an offset (as described later)
Changes to dependent libraries
Mirroring uses flat-namespace, a portion of the pre-binding is ignored
LinkContext
The environment variable prohibits pre-binding
ImageLoaderMachO
doRebase()
The following are the things that are done:
-
If using pre-binding, fgimageswithusedprebinding
Count plus one, and return
; otherwise go to step two
-
If mh_prebound
flag bit 1
, which can be pre-bound but not used, and mirrored in shared memory, resets all lazy pointer in the context. (If the image is in shared memory, it will be bound later in the binding process, so there is no need to reset)
-
If the mirror load address offset is 0, no rebasing, direct return
; Otherwise enter fourth step
-
Call rebase ()
method, which is the way to really do rebasing work. If &NBSP is turned on, text_reloc_support
macro will allow rebase ()
method to __text The
segment is written to fix-up it. So in fact __text
Read-only properties are not absolute.
ImageLoaderMachOClassic
and ImageLoaderMachOCompressed
to implement their own doRebase()
methods separately. The same logic is used to determine whether to use pre-binding and to determine TEXT_RELOC_SUPPORT
whether to write to the segment when the real binding is working __TEXT
. Finally, the setupLazyPointerHandler
entry point set in the mirror is called, and the dyld
last call is made to set the main executable __dyld
or __program_vars
.
Rebasing
In the past, Dylib was loaded into the specified address, and all pointers and data were right for the code and dyld
there was no need to do any fix-up. Now with ASLR regret to load dylib to a new random address (actual_address), this random address with the code and data point to the old address (preferred_address) will be biased, dyld
need to fix this deviation (slide), The procedure is to add this offset to the pointer address inside the DYLIB, and the offset is calculated as follows:
Slide = actual_address-preferred_address
Then there is the repetition of the __DATA
need to rebase the pointer in the segment to add this offset. This involves page fault and COW. This can cause I/O bottlenecks, but because the order of rebase is arranged by address, this is a sequential task from the kernel point of view, which reads the data in advance and reduces I/O consumption.
Binding
The binding is to handle pointers to external dylib, which are actually bound by the symbol name, which is a string. The preceding __LINKEDIT
paragraph also stores pointers that require bind, as well as the symbols that the pointer needs to point to. dyld
need to find the symbol corresponding to the implementation, which requires a lot of calculations, go to the symbol table lookup. When found, the contents are stored __DATA
in the pointer in the segment. The binding appears to be computationally larger than rebasing, but requires very little I/O operations because the rebasing has already been done for the binding.
OBJC Runtime
Many of the data structures in objective-c are fixed by rebasing and Binding (fix-up), such as Class
pointers to super-classes and pointers to methods.
OBJC is a dynamic language that can instantiate an object of a class with the name of the class. This means that the OBJC Runtime needs to maintain a global table of mapped class names and classes. When a dylib is loaded, all of its defined classes need to be registered in the global table.
One problem in C + + is the fragile base class (fragile base classes). OBJC does not have this problem because the offset of the instance variable is changed by the fix-up dynamic class at load time.
In OBJC, you can change the way a class is defined by defining a category. Sometimes you want to add the class of the method in another dylib, not your mirror (that is, to the system or other people's kind of knife), then also need to do some fix-up.
The selector in OBJC must be unique.
Initializers
C + + generates an initializer for statically created objects. There is a method called in ObjC +load
, but it is deprecated and is now recommended +initialize
. See more: Http://stackoverflow.com/questions/13326435/nsobject-load-and-initialize-what-do-they-do
Now that you have the main executable file, a bunch of dylib, whose dependencies make up a huge graph, what is the order of the initializers? From the top up! Depending on the dependency, the leaf nodes are loaded first, and then the intermediate nodes are loaded upward until the root node is finally loaded. This loading order ensures security, and the rest of the dylib files that it relies on must have been preloaded before loading a dylib.
Finally, the dyld
function is called main()
. main()
is called UIApplicationMain()
.
Improved start-up time
There is an animation between clicking the app icon and loading the app splash screen, and we want the app to start faster than the animation. Although the APP starts up differently on different devices, the boot time is best controlled at 400ms. It is important to note that once the boot time exceeds 20s, the system will assume that a dead loop has occurred and that the APP process has been killed. Of course, the startup time is best supported by the APP's minimum configuration device. Until applicationWillFinishLaunching
it is transferred, the APP starts to end.
Measuring Start-up time
Warm Launch:app and data are already in memory
Cold Launch:app not in kernel buffer memory
Cold start (Launch) time is the important data we need to measure, to accurately measure the cold start time, the need to restart the device before measuring. main()
It is difficult to measure before the method is executed, but it is good to provide the built dyld
-in measurement method: Set the environment variable to. auguments in Xcode, Run, Edit scheme DYLD_PRINT_STATISTICS
1
. The contents of the console output are as follows:
time:228.41 milliseconds (time:82.35 milliseconds (36%)
time:6.12 milliseconds (2.6%)
time:7.82 milliseconds (time:132.02 milliseconds (intializers:libsystem.b. dylib:122.07 milliseconds (53.4%)
corefoundation:5.59 milliseconds (2.4%)
Optimize startup time
You can optimize for each step before the App starts.
Load Dylib
Before mentioned the loading system dylib quickly, because there is optimization. However, loading embedded (embedded) dylib files takes time, so try to merge multiple inline dylib into one to load, or use static archive. It dlopen()
is not recommended to use lazy loading at run time, and doing so may cause some problems and the overall overhead is greater.
Rebase/binding
Previously mentioned that rebaing consumes a lot of time on I/O, and the subsequent Binding does not require I/O, but the time is spent on computation. So the time-consuming of these two steps is mixed together.
As I said before, you can reduce the __DATA
amount of time it takes to do this by reducing the number of pointers that you need to fix (fix-up) from the view segment. For OBJC, it is the reduction Class
, selector
and the category
number of these metadata. Theories like coding principles and design patterns encourage people to write more sophisticated and short classes and methods, and separate each part of the method into a single category, which in fact increases the startup time. For C + +, the virtual method needs to be reduced because the virtual method creates the vtable, which also creates the structure in the __DATA
segment. Although the C + + virtual method has less time to boot than OBJC metadata, it is still not negligible. Finally, it is recommended to use the SWIFT structure, which requires less fix-up content.
OBJC Setup
Few things can be done with this step, almost all by rebasing and Binding steps to reduce the required fix-up content. Because the work ahead will also make this step less time consuming.
Initializer Explicit initialization
Use +initialize
to replace+load
Do not use __atribute__((constructor))
to explicitly mark a method as an initializer, but rather let the initialization method call. such as use dispatch_once()
, pthread_once()
or std::once()
. That is, it is initialized during the first use and delays some of the work time.
Implicit initialization
For C + + static variables with complex (non-trivial) constructors:
The initializer is used where it is called.
Only the simple value type is assigned (pod:plain old data), so that the static linker will pre __DATA
-calculate the data in advance, eliminating the need for fix-up work.
Use the compiler warning flag -Wglobal-constructors
to discover implicit initialization code.
Use Swift to rewrite the code, because Swift has been pre-processed and strongly recommended.
Do not invoke in the initialization method dlopen()
and have an impact on performance. Because it dyld
runs before the App starts, because it is single-threaded, the system unlocks, but dlopen()
the multi-threading is turned on and the system has to be locked, which seriously affects performance, and can cause deadlocks and unexpected consequences. Therefore, do not create threads in the initializer.
Optimize your App's startup time