Data Structures unrelated to locks and limited resource manipulation by hazard pointers
By Andrei Alexandrescu and Maged Michael
Translated by Liu weipeng (pp_liu@msn.com)
Andrei Alexandrescu is a graduate student in the Computer Science Department of the University of Washington and author of modern c ++ design. His mailbox is andrei@metalanguage.com.
Maged Michael is a researcher at IBM's Thomas J. Watson Research Center.
First of all, I am honored to introduce Maged Michael, Maged Michael, and I have jointly written the current generic <programming>, which is an authoritative lock-free algorithm, furthermore, we have proposed surprising originality solutions for some difficult questions [3. In the article you will read, we will present you with such an exciting technology. The algorithms described below are achieved in an almost magical way by manipulating limited resources at hand.
If you compare the generic <programming> column to a soap opera, the villain roles in this episode will be attacked, every time I see this type of joke, I can only speak ). To help you recall who the so-called "villain" is, let's review the previous article [1]. (I will give you a brief review next, but the following content requires you to read the previous article at least once and roughly know what the problem is to be solved .)
In the previous article, after introducing the concept of lock-independent programming, we implemented a wrrm (write-rarely-read-committed) map that is irrelevant to the lock. We often see this type of data structure in the form of Global tables, factory objects, and high-speed cache. The problem we encountered was memory recycling: Since our data structure has to be unrelated to the lock, how should we reclaim the memory? As discussed earlier, this is a tricky issue, because lock independence means that any thread has an unlimited opportunity to operate on any object at any time. In order to solve this problem, the previous article has made the following attempts:
- Use reference count for map. This policy is doomed to fail because it requires simultaneous (atomic) Update of pointers and reference counting variables (where they are in different places in memory. Although a few papers use DCAS (double compare-and-swap, the behavior of this command is exactly what is needed here), this thing has never become popular, without it, we can still do many things, and it is not as powerful as being able to efficiently implement the transaction as intended. Refer to [2] the discussion on the usage of DCAS.
- Wait and delete. Once the "Cleanup" thread knows that a pointer is to be deleted, it can wait for a "long enough" and then delete it. However, the question is, how long should this "long enough" period last? Therefore, this scheme sounds as lame as "allocating a buffer that is large enough", and everyone knows about the latter.
- Store the reference count and pointer together. This solution uses the cas2 primitive that requires less and can be reasonably implemented. cas2 can be atomic compare-and-swap memory with two adjacent words. Most 32-bit machines support this primitive, but there are not so many 64-bit machines that support it, however, for the latter case, we can make some articles on the bit inside the pointer to solve this problem ).
The last solution mentioned above has proved to be feasible, but it turns our pretty wrrm map into a crude wrrmbntm (write-rarely-read-since-but-not-too-since) map, because the write operation will not be written until there is absolutely no read operation in progress. This means that as long as one read operation starts before all other read operations are completed, write operations can only be performed. This cannot be regarded as irrelevant to the lock.
Inspector bullet walked into his debtor's office, sat down, lit a cigar, and said in an absolutely calm voice that was enough to freeze the entire boiling office: "I won't leave unless I get my money back."
A feasible solution based on reference counting is to separate the reference counting variable from the pointer. In this way, write operations are no longer blocked by read operations, but they can only delete the old map [5] When the reference count is reduced to 0. In addition, if we use this solution, we only need to use CAS for single words, which enables our technology to be transplanted between different hardware architectures, including those that do not support cas2. However, there is still a problem with this solution: we have not eliminated the wait, but simply extended it, which increases the chance of memory usage problems.
Inspector bullet knocked on the door of his debtor's office and swallowed up and asked, "Can you pay me 19,990 yuan now? No? Oh, okay, okay. Here is the only 10 yuan I have left. I will come back later to see if you have any of my 20,000 yuan ."
In this solution, writing a function may retain any number of old maps that have not been recycled, and the reference count of them becomes 0. In other words, even if only one read thread is delayed, any amount of memory may not be reclaimed, and the longer the thread is dragged, the more the situation will be.
What we really need is a mechanism based on which the read thread can tell the write thread not to recycle some old maps under their eyes, at the same time, the read thread cannot force the write thread to wait and release a bunch of old maps, which may be any number. In fact, there is a solution that is not only irrelevant to the lock, but also waiting for unrelated solutions. (Recall the definition in the previous article: Lock independence means that there is always a thread in the system that can continue to execute; while wait independence is a stronger condition, it means that all threads can proceed .) In addition, this solution does not require any special operations, such as DCAS and cas2. It only needs to use CAS primitives that we can trust. I started to feel a little interesting, right? Then continue reading.
Hazard pointer
Let's review the code in the previous article. We have a wrrmmap template, which has a pointer to a classic single-thread map object (such as STD: map, it also provides a multi-threaded, lock-independent access interface to the outside world:
Template <class K, Class V>
Class wrrmmap {
Map <K, V> * pmap _;
...
};
Every time wrrmmap needs to be updated, the thread that wants to update it will first create a complete copy of it, update the copy, but then point pmap _ to the copy, finally, delete the old map originally referred to by pmap. We don't think this method is inefficient, because wrrmmap is usually read and rarely updated. However, the trouble is how we can properly destroy the old pmap _, because other threads may be reading them at any time.
The hazard pointer is our technology. With the help of the hazard pointer, threads can safely and efficiently inform all other threads of their memory usage. Each read thread has a shared pointer of "one write thread/multiple read Threads", which is called the "Hazard Pointer ". When a read thread assigns a map address to its hazard pointer, it means that it is declaring to other (write) threads: "I am reading this map. If you want, you can replace it, but do not change its content. Of course, do not delete it."
From the perspective of the write thread, before deleting any replaced old map, the write thread must check the hazard pointer of the read thread (to see if the old map is still in use ). That is to say, if a write thread replaces the map after one or more read threads have mounted a specific map to their hazard pointer, the write thread can delete the old map to be replaced only after the hazard pointer of the read thread is release.
Whenever a write thread replaces a map, it puts the replaced old map into a private linked list. When the number of old maps in the linked list reaches a certain value (we will discuss how to select the upper limit), The write thread will scan the hazard pointer of the read thread, see which of the old map lists match the hazard pointers. If an old map does not match the hazard pointer of any read thread, it is safe to destroy the map. Otherwise, the write thread will keep it in the linked list until the next scan.
The following is the data structure we will use. The primary shared structure is a single-chain table about the hazard pointer (hprectype), pointed by phead. Each item in the linked list contains a hazard pointer (phazard _), a flag indicating whether the hazard pointer is in use (Active _) and a pointer to the next node (pnext _).
The hprectype class provides two primitives: Acquire and release. Hprectype: acquire returns a pointer to an hprectype node, which is called P. In this way, the thread that obtains the pointer can set p-> phazard _ to ensure that other threads will treat the pointer with caution. When the thread no longer uses the pointer, it will call hprectype: release (P ).
// Hazard pointer record
Class hprectype {
Hprectype * pnext _;
Int active _;
// Global header of the HP list
Static hprectype * phead _;
// The length of the list
Static int listlen _;
Public:
// Can be used by the thread
// That acquired it
Void * phazard _;
Static hprectype * head (){
Return phead _;
}
// Acquires one hazard pointer
Static hprectype * acquire (){
// Try to reuse a retired HP record
Hprectype * P = phead _;
For (; P = p-> pnext _){
If (p-> active _ |
! CAS (& P-> active _, 0, 1 ))
Continue;
// Got one!
Return P;
}
// Increment the list Length
Int oldlen;
Do {
Oldlen = listlen _;
} While (! CAS (& listlen _, oldlen,
Oldlen + 1 ));
// Allocate a new one
Hprectype * P = new hprectype;
P-> active _ = 1;
P-> phazard _ = 0;
// Push it to the front
Do {
Old = phead _;
P-> pnext _ = old;
} While (! CAS (& phead _, old, p ));
Return P;
}
// Releases a hazard pointer
Static void release (hprectype * P ){
P-> phazard _ = 0;
P-> active _ = 0;
}
};
// Per-thread private variable
_ Per_thread _ vector <Map <K, V> *> rlist;
Each thread has a "decommission Linked List" (in our implementation, this is actually a vector <Map <K, V> *>). The "decommission linked list" is a container, it is responsible for storing pointers that are no longer needed by this thread and can be deleted once other threads no longer use them. Access to the "retired linked list" does not need to be synchronized because it is located in the storage zone of each thread. There is always only one thread (that is, the thread that owns it) to access it. We use the _ per_thread _ modifier to save the trouble of allocating thread-related storage. Through this method, every time a thread wants to destroy the old pmap _, it only needs to call the retire function. (Note: As mentioned in the previous article, we did not insert a memory barrier into the code for simplicity .)
Template <class K, Class V>
Class wrrmmap {
Map <K, V> * pmap _;
...
PRIVATE:
Static void retire (Map <K, V> * pold ){
// Put it in the retired list
Rlist. push_back (pold );
If (rlist. Size ()> = r ){
Scan (hprectype: Head ());
}
}
};
No concealment. That's all! Now, our scan function performs a difference set calculation algorithm, that is, the current thread's "retired linked list" as a set, and the hazard pointer of all other threads as a set, calculate the difference set of the two. So what is the significance of this difference set? Let's stop and consider: the previous set is the old map pointer that the current thread deems useless, the next set represents the old map pointers that are being used by one or more threads (that is, the map pointed by the hazard pointers of those threads), but the latter is equivalent to the "near-dead" pointer! According to the definition of "retired linked list" and hazard pointer, if a pointer is "retired" and is not referred to (used) by any thread's hazard pointer, the old map indicated by this pointer can be safely destroyed. In other words, the difference between the first set and the second set is the pointer to the old map that can be safely destroyed.
Main Algorithm
OK. Now let's take a look at how to implement the scan algorithm and what guarantees it can provide. During scan, we need to calculate the cooperative difference between the two sets represented by rlist (retired linked list) and phead (hazard pointer linked list. In other words: "For each pointer in the retired linked list, check whether it is also in the hazard linked list (phead). If not, it is a difference set and can be safely destroyed ." In addition, to optimize the algorithm, we can sort the hazard pointer linked list before searching, and then perform binary search for each retired pointer. The following is the implementation of scan:
Void scan (hprectype * head ){
// Stage 1: Scan hazard pointers list
// Collecting all non-null ptrs
Vector <void *> HP;
While (head ){
Void * P = head-> phazard _;
If (p) HP. push_back (P );
Head = head-> pnext _;
}
// Stage 2: sort the hazard pointers
Sort (HP. Begin (), HP. End (),
Less <void *> ());
// Stage 3: Search for 'em!
Vector <Map <K, V> * >:: iterator I =
Rlist. Begin ();
While (I! = Rlist. End ()){
If (! Binary_search (HP. Begin (),
HP. End (),
* I ){
// Aha!
Delete * I;
If (& * I! = & Rlist. Back ()){
* I = rlist. Back ();
}
Rlist. pop_back ();
} Else {
++ I;
}
}
}
The last loop does the actual work. Here, we use a small technique to avoid rlist vector shuffling: After deleting a deletable pointer in rlist, we fill the pointer position with the last element of rlist, and then pop the actual last element. This is allowed because the elements in the rlist do not need to be ordered. In addition, we use the C ++ standard library functions sort and binary_search. However, you can replace a vector with your favorite container, such as a hash table and other easy-to-search data structures. The expected query time of a balanced hash table is constant, and it is easy to construct such a hash table because it is completely private, and you know all the values before organizing it.
What is the efficiency of scan? First, it is noted that the entire algorithm is irrelevant (as described above): the execution time of each thread does not depend on the behavior of any other thread.
Secondly, the average rlist size is an arbitrary value, which is recorded as R. We use this value as the threshold for triggering scan (see the wrrmmap <K, V>: replace function in the code above ). If we store the hazard pointer in a hash table (instead of the original ordered vector), the expected complexity of the scan algorithm can be reduced to O (R ). Finally, the maximum number of old maps that have been retired but are still in the system (not deleted) is N * r, where N is the number of write threads: because only the write thread can replace the old map, and each write thread has a private rlist retired linked list with the maximum length of R ). As for the specific value of R, a good choice is (1 + k) h, where H is the number of hazard pointers (that is, the listlen in our code, in our example, this is the number of read threads (because each read thread has only one hazard pointer), and K is a small normal number, such as 1/4. Therefore, R is greater than H and proportional to H. In this way, each scan ensures that the R-H (O (R) nodes (old map) are deleted, and if we use a hash table, the time complexity is O (R ). Therefore, it is a constant to determine whether a node can be safely destroyed.
Stitch wrrmmap
Next we will stitch the hazard pointer into the wrrmmap primitive, namely, lookup and update. For the write thread (the thread that executes the update), all they need to do is change the place where Delete pmap _ should be normally called to call wrrmmap <K, V >:: retire.
Void Update (const K & K, const V & V ){
Map <K, V> * pnew = 0;
Do {
Map <K, V> * pold = pmap _;
Delete pnew;
Pnew = New Map <K, V> (* pold );
(* Pnew) [k] = V;
} While (! CAS (& pmap _, pold, pnew ));
Retire (pold );
}
The read thread first obtains a hazard pointer through hprectype: Acquire and assigns it to pmap _, so that the write thread can find it. After the read thread uses this pointer, it uses hprectype: release to release the corresponding hazard pointer.
V Lookup (const K & K ){
Hprectype * prec = hprectype: acquire ();
Map <K, V> * PTR;
Do {
PTR = pmap _;
Prec-> phazard _ = PTR;
} While (pmap _! = PTR );
// Save Willy
V result = (* PTR) [k];
// Prec can be released now
// Because it's not used anymore
Hprectype: release (Prec );
Return result;
}
In the above Code, why does the read thread need to re-check whether pmap _ is equal to PTR? (while (pmap _! = PTR ))? Consider the following situation: pmap _ originally points to map M, and the read thread assigns PTR to pmap _, that is, & M, then the read thread goes into sleep (at this time it has not had time to direct the hazard pointer to m ). During its sleep, a write thread sneaked in, updated pmap _, and then put the map (m) originally referred to by pmap _ into the retired linked list, then check the hazard pointer of each thread. Since the read thread has not been able to set the hazard pointer during sleep, M is considered an old map that can be destroyed, therefore, it is destroyed by the write thread. Then the read thread woke up and assigned the hazard pointer to PTR (that is, & M ). If this is the case, the read thread will unreference the PTR and read the corrupted data or access the unmapped memory. The above is why the read thread needs to check pmap _ again. If pmap _ is not equal to & M, the read thread cannot determine whether the write thread that retires M has seen its hazard pointer set to & M (in other words, whether m has been destroyed ). Therefore, it is not safe for the read thread to continue reading M. Therefore, the read thread must start from the beginning.
If the read thread finds that pmap _ does point to m, it is safe to read from M. So, does this mean that pmap _ has not been changed between two pmap _ reads? Not necessarily, m may be removed and mounted to pmap _ again (once or multiple times) during this period, but it doesn't matter. It is important that M is mounted to pmap _ during the second read, and the hazard pointer of the read thread has already pointed to it. Therefore, from this point (knowing that the hazard pointer is changed), access to m is safe.
In this way, lookup and update will become irrelevant to the lock (however, it is still not irrelevant to wait): The read thread will not block the write thread, multiple read threads do not crash with each other (this is not possible by referencing the counting solution ). Therefore, we get a perfect wrrm map: The read operation is very fast and does not conflict with each other, the update operation is still very fast, and the overall progress can be ensured.
If we want lookup to be irrelevant, we can use the TRAP Technique [4], which is a technology based on the hazard pointer. In the previous Code, when the read thread assigns its hazard pointer to PTR, it actually wants to capture a specific map (* PTR ). With the trap technology, the read thread can set a "trap", which ensures that a valid map can be captured, so that lookup becomes an unrelated waiting algorithm. Similarly, if we have an automatic garbage collection mechanism, lookup is also irrelevant (refer to the previous article in the column ). For details about the "trap" technology, refer to [4].
Generalization
So far, we have almost completed a solid map design. However, there are still some things to consider, so that you can have a comprehensive understanding of the technology. For a complete reference, see [3].
We already know how to share a map. But what if we want to share many objects? The answer is simple: first, our algorithms can be naturally promoted to the situation where there are multiple hazard pointers per Thread. However, the number of pointers that a thread needs to protect at any specific time is very limited. In addition, hazard pointers can be "overloaded" (because their type is void *): A thread can use the same hazard pointer to any number of data structures, the hazard pointer can only be used for one of the data structures at a time point. In most cases, each thread has one or two hazard pointers, which is sufficient for the entire program.
Note that after a lookup is completed, lookup immediately calls hprectype: release. In an application that considers performance factors, we can get the hazard pointer only once and use it for multiple lookup times before releasing it (once ).
Conclusion
For a long time, people have been trying to solve the issue of memory release of lock-independent algorithms. Sometimes people even feel that there is no satisfactory answer to this problem. However, in fact, we only need to do a little auxiliary work, while carefully operating between the private and shared data of the thread, we can create a powerful and satisfactory algorithm that guarantees both speed and memory usage. In addition, although we use wrrmmap as an example throughout the article, the hazard pointer technology can actually be applied to a much more complex data structure. Memory recovery is even more important for dynamic data structures that can be expanded or reduced at will. For example, imagine a program with thousands of linked lists and the size of each of them may grow to millions of nodes. This is where the hazard pointer shows its strength.
The worst thing a read thread can do is to abort a read thread if it fails to set its hazard pointer to null, this will cause each of its hazard pointers to objects to be never released.
Inspector bullet rushed into his debtor's office and immediately realized that he could not get his money back today. So he immediately said, "Dude, you are on my blacklist. I will visit you again later, unless you are dead. And even if you're 'unfortunate, the amount of debt you owe will not exceed 100. Goodbye !"
References
[1] Alexandrescu, Andrei. "generic <programming>: Lock-free data structures," C ++ users journal, October 2004.
[2] Doherty, Simon, David L. detlefs, Lindsay Grove, Christine H. flood, Victor luchangco, Paul. martin, Mark moir, NIR Shavit, and Jr. guy L. steele. "DCAS is not a silver bullet for nonblocking algorithm design. "Proceedings of the sixteenth annual ACM Symposium on parallelism in algorithms and ubuntures, pages 216-224. ACM Press 2004. ISBN 1-58113-840-7.
[3] Michael, Maged M. "Hazard pointers: Safe Memory reclamation for lock-free objects," IEEE Transactions on parallel and distributed systems, pages 491-504. IEEE, 2004.
[4] Michael, Maged M. "Practical lock-free and wait-free LL/SC/VL implementations using 64-bit cas," Proceedings of the 18th International Conference on distributed computing, lncs volume 3274, pages 144-158, October 2004.
[5] Tang, Hong, Kai Shen, and Tao Yang. "program transformation and runtime support for Threaded MPI execution on shared-memory machines," ACM transactions on programming languages and systems, 22 (4): 673-700,200 0 (citeseer.ist.psu.edu/tang99program.html ).