Author: Sanjay Ghemawat, Paul menage
Original
Translation: shiningray
Tcmalloc is more than glibc 2.3malloc
(It can be obtained from an independent library called ptmalloc2) and othermalloc
Fast. Ptmalloc is executed once on a GHz P4 machine (for small objects)malloc
Andfree
It takes about 300 nanoseconds. The same tcmalloc version only takes about 50 nanoseconds. The speed of the malloc version is crucial because ifmalloc
Not fast enough, the author of the application is likelymalloc
Write your own free list above
. This may cause extra Code complexity and more memory usage-unless the author carefully divides the size of the Free List and regularly clears idle objects from the Free List.
Tcmalloc also reduces lock contention in multi-threaded programs. Small objects have almost reached zero contention. For large objects, tcmalloc tries to use a fine-grained and effective spin lock.. Ptmalloc also reduces lock contention by using the respective locations of each thread, but there is a big problem when using ptmalloc2 to use each thread's site. In ptmalloc2, the memory may be moved from one site to another. This may cause a large amount of space to be wasted. For example, in a Google application, the first phase may allocate about MB of memory to its url-standardized data structure. When the first stage ends, the second stage starts from the same address space. What if the second stage is scheduled to be one or the first stage? Different Sites are used. In this phase, the memory left by the first stage is not reused, and an additional 300 mb is added to the address space. Similar memory explosion problems can be seen in other applications.
Another benefit of tcmalloc is the optimal spatial representation of small objects. For example, to allocate N 8-byte objects, you may need to use approximately8N * 1.01
Byte space. That is, 1% more space. In ptmalloc2, each object uses a four-byte header (I think) and the final size is adjusted to a multiple of 8 bytes.16N
Bytes.
To use tcmalloc, you only need to connect tcmalloc to your application through the "-ltcmalloc" linker.
You can also use tcmalloc in applications not compiled by yourself by using ld_preload:
$ LD_PRELOAD="/usr/lib/libtcmalloc.so"
Ld_preload is tricky, and we do not recommend this method very much.
Tcmalloc also contains a heap checker
And a heap surveyor.
.
If you want to link the tcmalloc version that does not contain the heap surveyors and the checker (for example, to reduce the size of static binary files), you can accesslibtcmalloc_minimal
.
Tcmalloc assigns a local cache for each thread. Small allocation can be satisfied directly by the local cache of the thread. If necessary, the object will be moved from the central data structure to the local cache of the thread, and regular garbage collection will be used to migrate the memory from the local cache of the thread to the central data structure.
Tcmalloc: The size is smaller than <=
32 K objects ("small" objects) are separated from large objects. Large objects are directly allocated from the central heap using a page-level distributor (a page is a 4 K alignment memory area. That is, a large object is always page aligned and occupies an integer number of pages.
Continuous pages can be divided into a series of small objects with the same size. For example, a continuous page (4 K) can be divided into 32 128-byte objects.
Allocation of small objects
The size of each small object is mapped to one of the 170 allocable dimension classes. For example, 961 to 1024 bytes are normalized to 1024 bytes. The dimension classes are separated as follows: smaller sizes differ by 8 bytes, larger sizes differ by 16 bytes, and larger sizes differ by 32 bytes, and so on. Maximum interval (for dimensions> = ~ 2 K) is 256 bytes.
A thread cache contains a one-way linked list of free objects for each dimension class..
When a small object is assigned:
- We map its size to the corresponding dimension class.
- Find the Free List in the thread cache of the current thread.
- If the Free List is not empty, remove the first object from the list and return it. Tcmalloc does not obtain any lock when following this expressconnect. This can greatly increase the allocation speed because the lock/unlock operation takes about 2.8 nanoseconds on a 100 GHz Xeon.
If the Free List is empty:
- Obtain a series of objects from the central Free List of the dimension category (the central Free List is shared by all threads.
- Place them in the thread..
- Return one of the newly obtained objects to the application.
If the central Free List is empty: (1) from the central page distributorAssigned a series of pages. (2) divide them into a series of objects of the dimension class. (4) move some objects into the Free List of local threads, just as before.
Allocation of large objects
The size of a large object (> 32 K) is divided by a page size (4 K) and rounded up (greater than the minimum integer in the result ), it is also handled by the central page heap. The central page heap is a free list array. Fori < 256
For examplek
Entries arek
Free List composed of pages. The256
Each entry contains a length.>= 256
Free List of pages:
k
Page throughk
In the Free List. If the Free List is empty, search for it in the next free list. Finally, if necessary, we will find it in the last free list. If this operation fails, we will obtain the memory from the system (usingsbrk
,mmap
Or/dev/mem
).
Ifk
Page allocation by consecutive length>k
The remaining continuous pages are inserted back to the Free List of the page heap.
Span)
The heap managed by tcmalloc is composed of a series of pages. A continuous page consists of a "span "(Span
) Object. A span can beAllocatedOrFree. If it is free, the span will be an entry in the page heap linked list. If it has been allocated, it will be a large object that has been passed to the application, or a page that has been split into a series of small objects. If the object is split into small objects, the size category of the object is recorded in the span.
The central array indexed by the page number can be used to locate the span of a page. For example, the following SpanAOccupies 2 pages and spansBOccupies 1 page, spanCOccupies the final span of 5 pagesDOccupies three pages.
In a 32-bit address space, the central array consists of a two-layer base tree.The root contains 32 entries, and each leaf contains 215 entries (a 32-bit address space contains 220 4 K pages, so the first layer of the tree here is to divide 220 pages by 25 ). As a result, the initial memory usage of the central array requires 215 KB (* 4 bytes), which seems acceptable.
On 64-bit machines, we will use a three-layer base tree.
When an object is unallocated, calculate its page number and search for the corresponding span object in the central array. This span will tell us whether the object is large or small. If it is a small object, what is the size category. If it is a small object, we will insert it into the free list corresponding to the thread cache of the current thread. If the thread cache is larger than a predetermined size (2 MB by default), we will run the Garbage Collector to move unused objects from the thread cache to the central Free List.
If this object is a large object, the span will tell us the page range covered by this object. Assume that the range is[p,q]
. We will also find the pagep-1
And pageq+1
The corresponding span. If either of the two adjacent spans is free, we will[p,q]
. The final span will be inserted into the appropriate free list in the page heap.
Central Free List of small objects
As mentioned above, we have set a central free list for each dimension category. Each Central free list consists of two data structures: A series of spans and a linked list of free objects for each span.
You can remove the first entry from a span to allocate an object from the central Free List. (If there is only an empty linked list in all spans, first assign a proper size span from the central page heap .)
An object can be returned to the central Free List by adding it to the linked list of its contained spans. If the length of the linked list is equal to the number of all small objects in the span, the span is completely free and will be returned to the page heap.
Garbage collection by thread Cache
When the total size of all objects in the cache in a thread cache exceeds 2 MB, garbage collection is performed on them. The garbage collection threshold is automatically reduced based on the increase in the number of threads, so that the memory will not be excessively wasted because the program has a large number of threads.
We will traverse all the free lists in the cache and move a certain number of objects from the Free List to the central list.
The number of objects removed from a free list is by using a low water level line per listL
.L
Records the shortest length of the list since the last garbage collection. Note: In the last garbage collection, we may have reduced the list.L
Objects without any additional access to the central list. We use this past history as a prediction for future access andL/2
Objects are moved from the thread cache free list to the corresponding central Free List. This algorithm has a good feature: If a thread no longer uses a specific size, all objects of this size will be quickly moved from the thread cache to the central Free List, it can then be used by other caches.
Ptmalloc2 unit test
The ptmalloc2 package (now part of glibc) contains a unit test programt-test1.c
. It will generate a certain number of threads and conduct a series of allocation and unallocation in each thread; there is no communication between threads except synchronization in the memory distributor.
t-test1
(Put intests/tcmalloc/
, Compiledptmalloc_unittest1
) With a series of different threads (1 ~ 20) and the maximum allocation size (64b ~ 32 KB) run. These tests run on a 2.4 GHz dual-core Xeon RedHat 9 system and enable hyper-Threading Technology, using a Linux glibc-2.3.2, 1 million operations per test. In each case, a normal operation and a useLD_PRELOAD=libtcmalloc.so
.
The following image shows the performance of tcmalloc compared with ptmalloc2 under different metrics. First, the actual operations per second (millions) and the maximum allocation size are targeted at different numbers of threads. Raw data used to produce these images (time
Tool Output)t-test1.times.txt
.
- Tcmalloc is more consistent and scalable than ptmalloc2 -- for tests with the number of all threads> 1, the small allocation reaches about 7 ~ Million operations per second, with a large allocation reduced to about 2 million operations per second. The case of a single thread is obviously to be removed, because it can only get a small number of operations per second because it can only keep a single processor busy. Ptmalloc2 has a higher variance on the number of operations per second-in some cases, the peak value can be equal to or greater than 4 million operations per second, while the peak value can be equal to or less than 1 million operations per second.
- Tcmalloc is faster than ptmalloc2 in most cases, and is especially matched with Small scores. Contention between threads is not a problem in tcmalloc.
- The performance of tcmalloc decreases with the increase of allocation size. This is because each thread cache will be garbage collected when it reaches the threshold (2 MB by default. For larger allocation dimensions, fewer objects can be stored in the cache before garbage collection.
- The performance of tcmalloc has a significant reduction in the attachment size of about 32 K. This is because the maximum size of the 32 K object in the cache of each thread; for the object greater than this value, tcmalloc will be allocated from the central page heap.
The following figure shows the number of operations per second (millions) of CPU time and the number of threads. The maximum allocated size is 64b ~ 128kb.
Once again, we can see that tcmalloc is more continuous and efficient than ptmalloc2. For the maximum allocation size of <32 K, tcmalloc typically reaches the CPU time of about 0.5 ~ per second in the case of a large number of threads ~ 1 million operations, while ptmalloc usually reaches the CPU time of about 0.5 ~ per second ~ 1 million, and in many cases it is much smaller than this number. Above the maximum allocation size of 32 K, tcmalloc is reduced to 1 ~ per CPU time ~ 1.5 million operations, while ptmalloc for large threads to almost zero (that is, the use of ptmalloc, in the case of high multithreading, A lot of CPU time is wasted on waiting for locks in turn ).
For some systems, tcmalloc may not be connectedlibpthread.so
Applications (or the same thing on your system) work normally. It should work properly on Linux using glibc 2.3, but the combination of other OS/libc has not been tested.
Tcmalloc may be more memory-consuming than other malloc versions to some extent (but it tends to avoid explosive growth in other malloc versions .) In particular, at startup, tcmalloc allocates approximately KB of internal memory.
Do not try to load tcmalloc into a running binary Program (for example, using JNI in Java ). The binary program has allocated some objects using the system malloc and will try to pass them to tcmalloc for unallocation. Tcmalloc cannot process such objects.