Lock-free programming through Linux Kernel

Source: Internet
Author: User

Introduction to non-blocking Synchronization

How to correctly and effectively protect shared data is to write parallelProgramA major challenge is synchronization. Synchronization can be divided into Blocking-type synchronization (blocking synchronization) and non-blocking-type synchronization (non-blocking synchronization ).

Blocking synchronization means that when a thread reaches the critical section, it is blocked because another thread already holds a lock to access the shared data and thus cannot obtain the lock resources until another thread releases the lock. Common synchronization primitives include mutex and semaphore. If the synchronization scheme is improperly used, the deadlock, livelock, priority inversion, and inefficiency will occur.

In order to reduce risks and improve program running efficiency, the industry has proposed a synchronization scheme that does not adopt locks.AlgorithmA non-blocking algorithm is essentially characterized by stopping the execution of a thread does not impede the running of other execution entities in the system.

There are three popular non-blocking synchronization implementation solutions:

    1. Wait-free

      Wait-free means that any operation of any thread can end in a limited step without worrying about the execution speed of other threads. Wait-free is based on per-thread and can be considered starvation-free. Unfortunately, this is not the case. The use of wait-free programs does not guarantee starvation-free, and the memory consumption also increases linearly with the number of threads. Currently, only a few non-blocking algorithms have implemented this.

    2. Lock-free

      Lock-free means to ensure that at least one of all threads that execute it can continue to be executed. Because each thread is not starvation-free, that is, some threads may be arbitrarily delayed. However, at least one thread in each step can be executed, therefore, the system as a whole is continuously executed and can be considered as system-wide. All wait-free algorithms are lock-free.

    3. Obstruction-free

      Obstruction-free means that at any point in time, each operation of an isolated running thread can end within a limited step. As long as there is no competition, the thread can continue to run. Once the shared data is modified, obstruction-free requires that you stop some completed operations and perform rollback. All lock-free algorithms are obstruction-free.

In conclusion, it is not difficult to conclude that obstruction-free is the worst performing non-blocking synchronization, while wait-free is the best, but it is also the most difficult to implement, therefore, the lock-free algorithm has been widely used in today's running programs, such as the Linux kernel.

Generally, the lock-free algorithm is implemented using atomic read-Modify-write primitives. ll and SC are ideal primitives in the field of lock-free theoretical research, however, implementing these primitives requires the support of CPU commands. Unfortunately, no CPU directly implements the SC primitive. Based on this theory, the industry has put forward the famous CAS (compare-and-swap) Operation to implement the lock-free Algorithm Based on atomic operations. Intel has implemented a command similar to this operation: cmpxchg8.

The CAS primitive is used to compare the value of a memory address (1 byte) with an expected value. If the value is equal, the value of the memory address is replaced with a new value, the CAS operation pseudo code is described as follows:

List 1. CAS pseudo-code

Bool CAS (T * ADDR, t expected, t newvalue) {If (* ADDR = expected) {* ADDR = newvalue; return true;} else return false ;}

 

In the actual development process, CAS is used for synchronization,CodeAs follows:


List 2. CAS operations

Do {back up old data; create new data based on old data;} while (! CAS (memory address, backup old data, new data ))

 

This means that when the two are compared, if they are equal, the shared data is not modified, replaced with a new value, and then continues to run. If they are not equal, the shared data has been modified, discard the operation that you have done, and then re-execute the operation. It is easy to see that the CAS operation is based on the assumption that the shared data will not be modified, and adopts the Commit-retry mode similar to the database. This assumption can greatly improve performance when there are few opportunities for synchronization conflicts.

Back to Top

Lock level

Based on the complexity, lock granularity, and running speed, the following lock levels can be obtained:


Figure 1. Lock level
 

Blocking synchronization and non-blocking Synchronization. The difference between lock-based and lockless-based is only the difference in lock granularity. The bottom-layer scheme in the figure is the mutex and semaphore which are frequently used. The Code complexity is low, but the running efficiency is also the lowest.

Back to Top

Lock-free analysis in Linux Kernel

The Linux kernel may be one of the largest and most complex parallel programs today. Its parallelism mainly involves interruptions, kernel preemption, and SMP. To continuously improve the efficiency of the Linux kernel, kernel designers gradually discard large kernel locks to reduce the granularity of the locks from a global perspective. Starting from the details, they constantly optimize local code, use lockless programming to replace lock-based solutions, such as seqlock and RCU. Reduce lock conflicts and wait time, such as double-checked locking and atomic locks.

Kernel no lock Level 1-less lock

No matter when the code in the critical section only needs to be locked once, it must be thread-safe when obtaining the lock, in this case, the double-checked locking mode can be used to reduce lock contention and lock loading. Currently, double-checked locking is widely used in singleton mode. Based on this idea, the kernel designer skillfully applies the double-checked locking method to kernel code.

When a process is dead, that is, the process is in the task_zombie State. If the parent process calls the waitpid () system call, the parent process needs to clean up the child process. The Code is as follows:


Listing 3. Less Lock Operation

984 static int wait_task_zombie (task_t * P, int noreap, 985 struct siginfo _ User * INFOP, 986 int _ User * stat_addr, struct rusage _ User * Ru) 987 {...... 1103 if (p-> real_parent! = P-> parent) {1104 write_lock_irq (& tasklist_lock); 1105/* double-check with lock held. */1106 if (p-> real_parent! = P-> parent) {1107 _ ptrace_unlink (p); 1108 // todo: Is this safe? 1109 p-> exit_state = exit_zombie ;...... 1120} 1121 write_unlock_irq (& tasklist_lock); 1122 }...... 1127}

 

If write_lock_irq is placed before Row 3, the lock range is too large, and the load of the lock increases, affecting the efficiency. If the lock code is put in the judgment and there are no 1103 lines of code, will the program be correct? It is correct in Single-core mode, but the problem occurs in dual-core mode. A non-main process runs on one CPU and is preparing to call exit to exit. At this time, the main process runs on another CPU and calls the above Code before the sub-process calls the release_task function. In the exit_notify function, the sub-process first holds the read/write lock tasklist_lock and calls forget_original_parent. The master process runs at 1104. Because the sub-process holds the lock first, the parent process has to wait. In the forget_original_parent function, if this sub-process has sub-processes, it will call reparent_thread () and execute the p-> parent = p-> real_parent; statement, resulting in the two being equal, when a non-master process releases the read/write lock tasklist_lock, the master process on the other CPU is awakened. Once executed, running continues, causing a bug.

Strictly speaking, double-checked locking is not in the category of lock-free programming, but it is a huge progress from every lock-free access. At the same time, we can see some clues from here. In order to reduce the lock conflict rate, reduce the waiting time, and improve the running efficiency, kernel developers are constantly improving.

Level 2 kernel lockless-atomic lock

Atomic operations can ensure that commands are executed in an atomic manner-the execution process is not interrupted. The kernel provides two sets of atomic operation interfaces: one for integer operations, and the other for separate bit operations. Atomic operations in the kernel are usually inline functions, which are generally completed by Embedded Assembly commands. Some simple requirements, such as global statistics and reference counting, can be attributed to atomic Calculation of integers.

Kernel lockless Level 3-lock-free

1. Lock-Free Application Scenario 1-spin lock

Spin lock is a lightweight synchronization method and a non-blocking lock. When the lock operation is blocked, instead of hanging itself to a waiting queue, the endless loop CPU idling waits for other threads to release the lock. The spin lock implementation code is as follows:


Listing 4. spin lock implementation code

Static inline void _ preempt_spin_lock (spinlock_t * Lock ){...... Do {preempt_enable (); While (spin_is_locked (LOCK) cpu_relax (); preempt_disable () ;}while (! _ Raw_spin_trylock (LOCK);} static inline int _ raw_spin_trylock (spinlock_t * Lock) {char oldval; _ ASM _ volatile _ ("xchgb % B0, % 1 ":" = Q "(oldval)," = m "(lock-> lock):" 0 "(0):" Memory "); return oldval> 0 ;}

 

Assembly Language commands xchgb atomic exchange of 8-bit oldval (store 0) and lock-> lock values, if the oldval is 1 (the initial lock value is 1), the lock is obtained successfully, otherwise, the loop continues, and then relax for a while, and then continues until the cycle is successful.

For an application, if you want to obtain the lock at any time, that is, to expect lock-> lock to be 1, use the CAS primitive to describe _ raw_spin_trylock (LOCK) CAS (lock-> lock, 1, 0 );

If the synchronization operation can always be completed within several commands, the use of spin lock will be an order of magnitude faster than the traditional mutex lock. Spin lock is often used in multi-core systems. It is suitable for scenarios where the lock hold time is less than the time required to block a thread and wake up.

The pthread library has provided support for spin lock, so the user-State program can also easily use spin lock, which must contain pthread. h. In some scenarios, pthread_spin_lock is more efficient than pthread_mutex_lock. In the US, the kernel implements the read/write spin lock, but the pthread fails to implement it.

2. Lock-Free Application Scenario 2-seqlock

The most common feature of watches is read time, rather than correction time. Once the latter becomes the most commonly used feature, consumers will certainly not buy it. The clock of the computer is also this function. The modification time is a small probability event, and the read time is a frequent action. The following code is taken from the 2.4.34 kernel:


Listing 5. 2.4.34 seqlock implementation code

443 void do_gettimeofday (struct timeval * TV) 444 {...... 448 read_lock_irqsave (& xtime_lock, flags );...... 455 sec = xtime. TV _sec; 456 USEC + = xtime. TV _usec; 457 read_unlock_irqrestore (& xtime_lock, flags );...... 466} 468 void do_settimeofday (struct timeval * TV) 469 {470 write_lock_irq (& xtime_lock );...... 490 write_unlock_irq (& xtime_lock); 491}

 

It is not difficult to find that the get time and modify time use the spin lock read/write lock. The read lock and write lock have the same priority. As long as the read holds the lock, the write lock must wait, and vice versa.

In Linux 2.6, a new type of lock, seqlock, is introduced. It is very similar to the spin lock read/write lock, but it gives the writer a higher priority. That is to say, even when the reader is reading, the writer is allowed to continue running. When multiple readers and a few writers share a lock, seqlock is useful, because seqlock is more advantageous to the writer. As long as there are no other writers, the write lock can always be successful. According to the concept of lock-free and clock functions, the kernel developer in the 2.6 kernel modifies the read/write lock to the sequential lock seqlock. The Code is as follows:


Listing 6. 2.6.10 seqlock implementation code

 static inline unsigned read_seqbegin (const seqlock_t * SL) {unsigned ret = SL-> sequence; smp_ RMB (); return ret;} static inline int read_seqretry (const seqlock_t * SL, unsigned IV) {smp_ RMB (); Return (IV & 1) | (SL-> sequence ^ IV);} static inline void write_seqlock (seqlock_t * SL) {spin_lock (& SL-> lock); ++ sl-> sequence; smp_wmb ();} void do_gettime Ofday (struct timeval * TV) {unsigned long seq; unsigned long USEC, SEC; unsigned long max_ntp_tick ;...... Do {unsigned long lost; seq = read_seqbegin (& xtime_lock );...... SEC = xtime. TV _sec; USEC + = (xtime. TV _nsec/1000);} while (read_seqretry (& xtime_lock, SEQ ));...... TV-> TV _sec = sec; TV-> TV _usec = USEC;} int do_settimeofday (struct timespec * TV ){...... Write_seqlock_irq (& xtime_lock );...... Write_sequnlock_irq (& xtime_lock); clock_was_set (); Return 0 ;}

 

The principle of seqlock is to rely on a Sequence Counter. When the writer writes data, it gets a lock and adds the sequence value to 1. This serial number is read before and after the reader reads the data. If the serial number values are the same, it indicates that the write has not occurred. Otherwise, it indicates that a write event has occurred, and the operation is abandoned and recycled once again until the operation is successful. It is not hard to see that the while loop in the do_gettimeofday function and the next two row assignment operations are CAS operations.

The advantage of using sequential lock is that the writer will never wait, but sometimes the reader has to read the same data multiple times until it obtains a valid copy. Seqlock can be used when the critical section to be protected is very small and is very simple. When frequent reading and writing occur rarely (wrrm --- write rarely read mostly) and must be fast. However, seqlock cannot protect the data structure containing pointers, because when the writer modifies the data structure, the reader may access an invalid pointer.

3. Lock-Free Application Scenario 3-RCU

In the 2.6 kernel, developers also introduced a new lock-free mechanism-RCU (read-copy-Update) that allows concurrent execution by multiple readers and writers. The core of RCU technology is that write operations are divided into two steps: Write and update, allowing read operations to run without hindrance at any time. In other words, the synchronization performance is improved through delayed writing. RCU is mainly used in wrrm scenarios, but it limits the data structure that can be protected: RCU only protects data structures that are dynamically allocated and referenced by pointers, at the same time, the read/write control path cannot have sleep. The following code for the dynamic growth of an array is taken from the 2.4.34 kernel:

Listing 7. 2.4.34 RCU implementation code

Ipc_lock is the reader and grow_ary is the writer. Whether it is read or write, spin lock is required to access the protected data structure. Changing the array size is a small probability event, while reading is a high probability event, and the protected data structure is a pointer to meet RCU application scenarios. The following code is taken from kernel 2.6.10:


List 8. 2.6.10 RCU implementation code

# Define rcu_read_lock () preempt_disable () # define rcu_read_unlock () preempt_enable () # define Merge (p, V) ({\ smp_wmb (); \ (p) = (V ); \}) struct kern_ipc_perm * ipc_lock (struct ipc_ids * IDs, int ID ){...... Rcu_read_lock (); entries = rcu_dereference (IDS-> entries); If (Lid> = entries-> size) {rcu_read_unlock (); return NULL ;} out = entries-> P [lid]; If (out = NULL) {rcu_read_unlock (); return NULL ;}...... Return out;} static int grow_ary (struct ipc_ids * IDs, int newsize) {struct ipc_id_ary * New; struct ipc_id_ary * old ;...... New = ipc_rcu_alloc (sizeof (struct kern_ipc_perm *) * newsize + sizeof (struct ipc_id_ary); If (New = NULL) return size; New-> size = newsize; memcpy (New-> P, IDS-> entries-> P, sizeof (struct kern_ipc_perm *) * size + sizeof (struct ipc_id_ary); for (I = size; I <newsize; I ++) {New-> P [I] = NULL;} Old = IDS-> entries;/** use rcu_assign_pointer () to make sure the memcpyed Contents * of the new array are visible before the new array becomes visible. */rcu_assign_pointer (IDS-> entries, new); ipc_rcu_putref (old); Return newsize ;}

 

Throughout the process, the writer has almost no lock except the kernel barrier. When the writer needs to update the data structure, first copy the data structure, apply for new memory, then modify the copy, and call memcpy to copy the content of the original array to new, at the same time, add new values to the extended part. After modification, the writer calls rcu_assign_pointer to modify the pointer of the relevant data structure so that it points to the new copy after modification. The entire write operation is in one breath, the operation to modify the pointer value is an atomic operation. After the data structure is modified by the writer, you need to call the memory barrier smp_wmb to let other CPUs know the updated pointer value. Otherwise, a bug in the SMP environment will occur. Call call_rcu to release the old copy after all potential readers have completed the execution. Like spin lock, RCU synchronization is mainly applicable to SMP environments.

Level 4 of kernel lock-free

The circular buffer is a common data structure in the producer and consumer models. The producer puts the data into the end of the array, and the consumer removes the data from the other end of the array. when the end of the array is reached, the producer returns to the header of the array.

If there is only one producer and one consumer, the ring buffer can be accessed without a lock ). Writing an index only allows the producer to access and modify it. As long as the writer saves the new value to the buffer before updating the index, the reader will always see the consistent data structure. Similarly, read indexes can only be accessed and modified by consumers.


Figure 2. Implementation principle of the ring Buffer
 

When the reader and the writer pointer are equal, it indicates that the buffer zone is empty. If the write pointer is behind the read pointer, it indicates that the buffer zone is full.


Listing 9. 2.6.10 circular buffer implementation code

/** _ K1_o_put-puts some data into the FIFO, no locking version * Note that with only one concurrent reader and one concurrent * writer, you don't need extra locking to use these functions. */unsigned int _ k1_o_put (struct kfifo * FIFO, unsigned char * buffer, unsigned int Len) {unsigned int L; Len = min (Len, FIFO-> size-FIFO-> in + FIFO-> out);/* first put the data starting from FIFO-> in to buffer end */L = min (Len, FIFO-> size-(FIFO-> In & (FIFO-> size-1 ))); memcpy (FIFO-> buffer + (FIFO-> In & (FIFO-> size-1), buffer, L);/* Then put the rest (if any) at the beginning of the buffer */memcpy (FIFO-> buffer, buffer + L, len-l); FIFO-> in + = Len; return Len ;} /** _ k1_o_get-gets some data from the FIFO, no locking version * Note that with only one concurrent reader and one concurrent * writer, you don't need extra locking to use these functions. */unsigned int _ kw.o_get (struct kfifo * FIFO, unsigned char * buffer, unsigned int Len) {unsigned int L; Len = min (Len, FIFO-> In-FIFO-> out);/* first get the data from FIFO-> out until the end of the buffer */L = min (Len, FIFO-> size-(FIFO-> out & (FIFO-> size-1); memcpy (buffer, FIFO-> buffer + (FIFO-> out & (FIFO-> size-1), L);/* then get the rest (if any) from the beginning of the buffer */memcpy (buffer + L, FIFO-> buffer, len-l); FIFO-> out + = Len; return Len ;}

 

The above code is taken from the 2.6.10 kernel. It can be seen through the comments (italic part) of the Code that when there is only one consumer and one producer, no additional locks can be added, you can access Shared data.

Back to Top

Summary

By comparing the kernel code 2.4 and 2.6, We have to admire the wisdom of kernel developers. To improve kernel performance, we have been constantly optimizing the kernel, and apply the latest lock-free concept to the kernel.

In the actual development process, the scenario analysis is performed first when the lock-free design is performed, because each lock-free solution has a specific application scenario, and then the preliminary design of the data structure is conducted based on the scenario analysis, then, the concurrent model is modeled based on the previous analysis results. Finally, the data structure is adjusted to achieve the optimum.

 

References

    • Andrei Alexandrescu. Lock-free data structures --- keeping threads moving while avoiding deadlock, dr. Dobb's journal, October 01,200 4.

    • Non-blocking synchronization, http://en.wikipedia.org/wiki/Non-blocking_synchronization
    • Shameem Akhter and Jason Robert ts. Li Baofeng, Fu Hongyi, Li Tao, multi-core programming technology, Electronic Industry Press, 2007.
    • Rebert love, Linux kernel development, 2rd edition, Mechanical Industry Press, 2006.
    • Daniel P. bovet, Marco cesati, understanding the Linux kernel, 3rd edition, Southeast University Press, 2006.
    • Jonatban Corbet, translated by Wei Yongming, Linux Device Drivers, China Power Press, 2006.
    • Gordon Fischer, the Linux kernel Prime, Machine Industry Press, 2006.
    • In the developerworks Linux area, find more references for Linux developers (including new Linux beginners) and refer to our most popular articles and tutorials.
    • Read all Linux tips and Linux tutorials on developerworks.

About the author

Yang Xiaohua, who is currently engaged in Linux kernel research, is familiar with Linux interrupt systems. You can get in touch with him through the normalnotebook@126.com.

This article from: http://www.ibm.com/developerworks/cn/linux/l-cn-lockfree/
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.