For most developers, the system memory allocation is a black box, which is called by several APIs. If you have one, you can give it to me. If you do not have one, you can find another solution. Before coming to UC, I thought so. When I went deep into the field, I found that this field is also a great competition. Memory Allocator (Memory Allocator) at the operating system level, at the application level, designed for real-time systems, and designed for service programs. However, they confirm the same purpose, balancing the Memory Allocation performance and improving the memory usage efficiency.
From the perspective of browser development, the growth rate of the mobile phone memory relative to the growth of web content is still a warm level, and large memory users like android need to calculate and use it. Recently, for work reasons, I have involved a small memory distributor, so I did some superficial learning and did not fully read the code or perform a thorough test, write a summary and related documents here for future reference.
Real-world problems with memory allocation
First, we usually use a memory distributor, that is, the malloc/free function is not provided by the system, but provided by the C standard library. It is also called a dynamic memory distributor. The memory (Virtual Memory) taken by the Allocator from the operating system is measured in pages (usually 4 kb, called sbrk or MMAP), and then managed by itself.
As mentioned above, the memory distributor faces two core problems: Performance and performance (or throughput throughoutput ). The former ensures that the memory is available at any time, and the latter ensures that the service time is short and not slow.
For a system process, there are two reasons for eliminating the memory usage bug in out of memory (OOM:
1. the system really has no memory available.
2. Memory Allocation wastes a lot of space. Although there is a lot of scattered available space, it cannot be combined for use. The former is the real Oom, and the latter is the fragmentation problem.
When the malloc in libc fails to allocate memory, the process will be abort by default, that is, Crash. If the system supports mallopt, it will have the opportunity to change this behavior. Unfortunately, Android does not yet support it.
When loading, parsing, and rendering pages, the browser allocates a large number of small objects. You can see the following figure:
The central axis is the object size, and the vertical axis is the number of requests for allocation. If the memory is applied in units of pages, it is simple, and no distributor is required. It is the small objects that consume a small amount of resources and are frequently used, which can easily cause fragments that cannot be used on the page (internal fragmentation ).
For performance, memory allocation is a bottle diameter of swap I/O. Although in most cases, the memory distributor has an important indicator, namely, the upper limit (bounded limits ). Although the average value looks good, can the performance be guaranteed in the case of the worst case (wrost case? Especially in multi-thread scenarios, the performance of memory allocation and release is often affected by locking. Some splitters (such as ptmalloc) Take performance too seriously, and cannot share the memory between threads. Each of them occupies a part, which reduces the memory usage efficiency.
These problems have always existed. Different people have designed different distributor algorithms (DSA and dynamic storage algorithms from the application perspective) for different scenarios ), in addition, almost everyone says they are better than others. For example:
1. the Allocator provided by the dlmalloc/ptmalloc/ptmallocx C standard library is also the default malloc/free function used by applications.
2. tcmalloc is applied in Google, WebKit, and chrome.
3. bmalloc after all, chrome and WebKit are getting farther and farther away, so Apple provides a new distributor in the latest WebKit code (2014-04). It claims that it far exceeds tcmalloc, at least in terms of performance.
4. jemalloc was originally developed for FreeBSD. Later, the Firefox browser and Facebook server were applied, and it was greatly improved in these applications.
5. hoard is a multi-thread-optimized distributor. The author is a university professor and has some unique technologies. The malloc in Mac OS X can be optimized based on its implementation.
* WebKit also provides a so-called plain old data areana class for render objects, which is also a memory pool implementation (podintervaltree, podarena ).
Core Ideas and Algorithms
The core idea of the distribution is similar to that of the algorithm and metadata storage. The essay provided in Appendix 13 has a comprehensive summary. You can refer to it for more information.
The core idea of a memory distributor is summarized as follows:
1.Basic functions: First, the memory zone (memory pool) is defined in the smallest unit (chunk), and then the memory is managed separately according to the object size. The small memory is divided into several classes (size class ), it is used to allocate fixed-size memory blocks and manage them with a single table to reduce the internal fragmentation (internal fragmentation ). Large Memory is managed in units of pages, which are used with the pages where small objects are located to reduce fragments. Design a good storage solution, that is, the storage of metadata, to reduce memory usage. At the same time, the storage of memory information is optimized so that the access performance to each size class or large memory area is optimal and there is an upper limit (bounded limits ).
For example, dlmalloc defines bins (same size class) to store memory blocks of different sizes:
2.Recovery and Prediction: When the memory is released, it must be able to merge small memory into large memory. According to some conditions, the reserved memory will be retained and the system can respond quickly in the next use. If it is not needed to be retained, it is released back to the system to avoid long-term occupation.
3.Optimize the performance of multiple threads: In a multi-threaded environment, each thread can occupy a memory interval independently, known as TLS (Thread Local Storage), so that no locks can be applied during operations in the thread to improve performance. It is the TLS schematic posted on msdn. For details, refer:
* In addition, the test tool is also essential. For example, the heap profile of tcmalloc and jemalloc are combined with valgrind.
The above ideas are basically the same for each distributor, but how to organize the size classes? If a fixed step size is used, a huge and inefficient table will be formed, for the reason, see the first figure. In addition, how can I locate memory blocks? How to solve the problem of false cache line under multiple threads? Different distributors use different algorithms and data structures. These algorithms are collectively referred to as DSA and dynamic storage algorithms.
The specific algorithm implementation can be found in the reference list below, or you can refer to the Appendix 16 first to have a general impression.
Some basic algorithms are also similar, such as the algorithms that organize the list by binary trees, that is, the differences between in-place, Cartesian tree and red-block. In the thread, different implementations may lead to different memory usage. For example, when jemalloc is released, it does not need to be released in the original allocation thread, but is put back to the Free List of the allocation thread, and ptmalloc must return to the allocation thread for release, performance is weaker. Tcmalloc designs algorithms so that a thread can steal some space from its neighbors (this process is called transfer cache), so that it can effectively use the memory between threads.
Advantages and disadvantages
Disadvantages of ptmalloc: Performance and memory usage in multiple threads (memory cannot be shared among threads), and the memory used to store metadata is costly, which wastes a lot in small memory allocation. Advantage: it is a standard implementation, and tcmalloc is inferior: because of the algorithm design, the memory occupied is large. Advantage: multi-thread performance. See appendix 6. Advantage of jemalloc: the memory fragmentation rate is low, and the performance in multiple cores is better than that in tcmalloc. See Appendix 17.
Time is limited, and no further research has been conducted. In practical application, there are still some parameters that can be adjusted on the premise that you should be familiar with the implementation, especially the performance evaluation method.
Reference
This is the longest reference list in my column. My predecessors have indeed done a lot of research. I only read the content in the list, not all of which are related, I just think that some content can be verified by each other and then listed.
1. Reflection on using red-block tree in jemalloc [LINK]
The article was published in 2008. When I applied it to Facebook in 2009, I optimized the algorithm.
2. On jemalloc, in 2011, the author wrote an article about jemalloc's core Algorithms and Its Application Effects on Facebook after applying jemalloc on Facebook. [LINK] [early papers with more details]
3. the android fragmentation measurement is transformed by Rom.
4. Hoard offical [LINK]
5. How does malloc work on Mac OS [LINK]
6. Comparison of WebKit application tcmalloc [LINK]
7. How tcmalloc works [LINK] [Chinese translation]
8. tcmalloc source code analysis, very good information. The author's website is also worth reading. [LINK 1]
9. The early technical documentation of dlmalloc describes its Core algorithms. [LINK]
10. ptmalloc source code analysis is very systematic and worth reading. [Csdn download link]
11. Introduction to jemalloc: better memory management-jemalloc [LINK]
12. Four methods to replace the system malloc [LINK]
13. Introduce the memory allocation algorithm TLSF optimized for the real-time system, and summarize the Dynamic Allocation Algorithm (DSA. [LINK]
14. On Wikipedia, you may feel the connectivity of Thread Local Storage. [LINK]
15. Comparison of various allocation algorithms for real-time systems can be combined with 13. [LINK]
16. Study on the memory allocation policies of ptmalloc, tcmalloc, and jemalloc. [LINK]
17. After firefox3 uses jemalloc, we can see the Firefox optimization ideas. [LINK] [source code used by Firefox]
18. chromimum project: out of memory handling, which has a good idea. [LINK]