Nah Lock: A lock-free memory splitter

Last Update:2014-12-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Overview

I implemented two completely unlocked memory allocators: _nalloc and Nalloc. I used the benchmark tool to perform a comprehensive set of tests and compared their indicator values.

The first allocator had a poor test result compared to libc (glibc malloc), but I learned a lot from it, and then I realized the second unlocked splitter, with the number of cores increasing to 30, the test results were linearly improved. The number of cores increased to 60, and the test results were linearly improved, but only a little better than tcmalloc.

To install, enter the command: Git clone ~apodolsk/repo/nalloc, read the Readme document.

memory allocator is important, Because most programs use them, and many programs use them heavily. For billions of good programs, a bad allocator is the central point of competition, and a good dispenser can be a substitute for falling out of the sky, a hardware-friendly way to reverse the bad memory access program.

All the scalable memory allocators I know, including the existing unlocked allocator, by splitting the address space into the CPU or thread local sub-heap (subheap), Try to turn the allocation process into a data parallelism problem. In the best case scenario, this leads to an optimized location, reducing the additional benefits of error sharing and enhanced preprocessing, because each thread accesses a neighboring collection of thread-private cache lines.

&NBSP;

analysis and challenges

&NBSP;

on the other hand, a parallel allocator, provided by unknown dependencies, provides pre-allocated, fuzzy workloads that require an unknown amount of communication to address. Thus, It needs to be scaled up to different levels of parallelism.

&NBSP;

I am very excited that some design problems are very similar to Web servers. Like servers, in a burst workload, the allocator needs to meet the latency and throughput targets, which is achieved by balancing the target of resource use, such as reducing fragmentation and crashes. "Allocating more nodes, even more than needed" and "getting more pages from the global heap, even more than needed" has similar costs and benefits, and the issue of "start-up overhead" will emerge.

The next mention is the less abstract difficulty I've come across.

__nalloc

Introduced

__nalloc is "naïve", because it may be, I will use every semester, almost the same basic design. I assume that the bottleneck is in sync control, so I plan to add a fast single-threaded algorithm to the efficient, lock-free wrapper. If I'm more inclined to analyze than to "do something that sounds elegant" , or using an already existing allocator with only a rough outline, some of the magical, unnamed issues that I have encountered may become apparent from the very beginning.

The target allocation size for both __nalloc and Nalloc is less than or equal to 1024B, mainly to make the task simpler.

The main ideas are as follows:

In each local line the Cheng Zi stack runs a single-threaded allocator ("With range identification, as well as easy to separate and merge the isolated linked list,").
Allocates memory from the global heap using the page's lock-free stack.
Use another lock-free stack belonging to each thread to return the moving/"fickle" blocks to their original thread.

The following is a more detailed algorithm. If you feel that the attached description above makes sense, you can skip here. Anyway, the last part of the "fickle block" is interesting:

Each thread retains a "free doubly linked list" of free memory blocks of various sizes.
Each block is located in a "arena", which is a page-size block of memory, with the address and size "naturally aligned". The thread fills the idle list by fetching more arena from the page's global lock-free stack.

When the stack of pages is empty, the thread gets arena by calling Mmap () to allocate a batch of pages from the OS.
Each new arena retains a single piece of this maximum space.

In malloc (), a thread pulls a block from the stack that matches the request size.

It cuts away the extra space and makes it a new block.
If the list is empty, it tries the next largest block until the allocation is complete and has to acquire a new arena.

In free (), a thread merges the idle block and its adjacent parts as much as possible.

To achieve this, each block B requires 4 B header information to store the "Is_free" ID in memory, the size of B and the "left" of B.
All merged contiguous spaces are removed from their list of free links.

If thread F releases block B allocated by thread m, thread F inserts block B into a "fickle" lock-free stack associated with thread m.

benchmark test

The third Test, a single thread is assigned to a global pool, and other threads free memory from the global pool.

&NBSP;

arena initialization consumes 15% of the elapsed time. In the relevant ASM, the most time-consuming instruction is the first MOV-to-memory instruction. This can be annoying, But I happen to know that Linux is using too much memory. That is, mmap () retains the virtual address, but it assumes that you do not actually need the memory, and that you do not get the physical frame until you actually use it. I think Arena_init is a page fault handling, In order to implement those new VM mappings. I was told a mmap logo to cancel overuse.

&NBSP;

After sending an email claiming "correct" to Kayvon, I realized that there were very bad and obvious race conditions:

A thread exits after modifying the arena "fickle" block pointer to the arena of the non-self-block stack.
At the same time, another thread reads the previous pointer and presses itself into the stack of "fickle" blocks of the thread that just died.
Page failure or crash.

Nalloc's design is also fragile. One solution I found (described next) does not work here. You can reference the number of allocated blocks, which may close the stack of threads "fickle" blocks. There will be a more sensible (unlocked) way, rather than the usual way of locking instructions, To meet a rare extreme situation?

Results

After that, for a long time the bug was modified, the following is theoretically perfectly parallel, no mobile workloads, performance on 64-core machines:

Nah Lock: A lock-free memory splitter

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Nah Lock: A lock-free memory splitter

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Nah Lock: A lock-free memory splitter

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support