This is a creation in Article, where the information may have evolved or changed.
English Original: Distributed Read-write Mutex in Go
Default sync for Go language. Rwmutex implementations do not perform well in multi-core environments, as all readers preempt the same memory address when doing atomic incremental operations. This paper explores a n-way Rwmutex, also known as a "big reader" lock, which allocates independent Rwmutex for each CPU core. The reader only needs to handle the read lock in its core, while the writer must process all the locks in turn.
Find Current CPU
The reader uses the CPUID instruction to determine which lock to use, which only needs to return the apicid of the currently active CPU without issuing a system call instruction or changing the runtime. This is possible on Intel or AMD processors, and the ARM processor needs to use the CPU ID register. For systems with more than 256 processors, x2apic must be used, plus the EDX register with EAX=0XB in addition to the CPUID. When the program starts, a mapping is built (called by the CPU affinity system) Apicid to the CPU index, which is statically present throughout the processor's lifecycle. Because the cost of the CPUID instruction can be quite expensive, goroutine will update the status results on a regular basis only in the kernel in which it runs. Frequent updates can reduce kernel lock blocking, but it also results in increased CPUID instruction time spent in the lock process.
Stale CPU information. If you add a lock, the CPU information that runs goroutine may be outdated (Goroutine will be transferred to another core). While reader remembers which is locked, this only affects performance and does not affect accuracy, and of course, such transfers are unlikely, just as the operating system kernel tries to keep threads in the same core to improve the cache hit rate.
Performance
The performance characteristics of this pattern are affected by a large number of parameters. In particular, the frequency of CPUID detection, the number of readers, the ratio of readers to writers, and the time of the readers holding the lock are all important factors. When there is only one writer active at this time, the time when this writer holds the lock does not affect sync. Performance differences between Rwmutex and Drwmutex.
Experiments have shown that Drwmutex performance is better than multi-core systems, especially when writer is less than 1%, the CPUID is called between up to 10 locks (this varies depending on the duration of the lock being held). Even in the case of low nuclei, Drwmutex is better than Sync.rwmutex in the general choice of using Sync.rwmutex applications with Sync.mutex.
Displays the number of cores used to increase the average performance per 10:
Drwmutex-i 5000-p 0.0001-w 1-r 100-c 100
The error bar represents the 25th and 75th percentile. Note The descent of each 10th nucleus; This is because 10 cores make up a NUMA node on a machine running a standard inspection system, so once you add a NUMA node, cross-thread traffic becomes more valuable. For Drwmutex, because of the contrast sync. Rwmutex more reader can work in parallel, so performance increases as well.
View Go-nuts tread further discussion
Cpu_amd64.s
#include "textflag.h" // func cpu() uint64TEXT 路cpu(SB),NOSPLIT,$0-8 MOVL $0x01, AX // version information MOVL $0x00, BX // any leaf will do MOVL $0x00, CX // any subleaf will do // call CPUID BYTE $0x0f BYTE $0xa2 SHRQ $24, BX // logical cpu id is put in EBX[31-24] MOVQ BX, ret+0(FP) RET
Main.go
Package main import ("Flag" "FMT" "Math/rand" "OS" "Runtime" "Runtime/pprof" "Sync" "Syscall" "Time" "unsafe") Func CPU () UInt64//implemented in CPU_AMD64.S var CPUs map[uint64]int//Determine mapping from APIC ID to CPU index by pinning the entire process to//one core at the time, and seeing that its APIC ID is.func init () { CPUs = Make (Map[uint64]int) var aff uint64 Syscall. Syscall (Syscall. Sys_sched_getaffinity, UIntPtr (0), unsafe. Sizeof (AFF), uintptr (unsafe. Pointer (&aff)) N: = 0 Start: = time. Now () var mask uint64 = 1outer:for {for (AFF & mask) = = 0 {Mask <<= 1 if M Ask = = 0 | | Mask > AFF {break Outer}} ret, _, Err: = Syscall. Syscall (Syscall. Sys_sched_setaffinity, UIntPtr (0), unsafe. Sizeof (Mask), uintptr (unsafe. Pointer (&mask))) if RET! = 0 {Panic (err. Error ())}//What CPU does we have? <-Time. After (1 * time.millisecond) c: = CPU () If oldn, OK: = Cpus[c]; OK {fmt. PRINTLN ("CPU", n, "= =", Oldn, "--both has CPUID", c)} Cpus[c] = n Mask <<= 1 n++ } FMT. Printf ("%d/%d CPUs found in%v:%v\n", Len (CPUs), runtime. NUMCPU (), time. Now (). Sub (start), CPUs) ret, _, Err: = Syscall. Syscall (Syscall. Sys_sched_setaffinity, UIntPtr (0), unsafe. Sizeof (AFF), uintptr (unsafe. Pointer (&aff))) if RET! = 0 {Panic (err. Error ())}} type RWMutex2 []sync. Rwmutex func (MX RWMutex2) Lock () {for core: = Range mx {mx[core]. Lock ()}} func (MX RWMutex2) Unlock () {for core: = Range mx {mx[core]. Unlock ()}} func main () {cpuprofile: = flag. Bool ("Cpuprofile", False, "enable CPU profiling") Locks: = flag. Uint64 ("I", 10000, "number of iterations to perform") Write: = flag. Float64 ("P", 0.0001, "Probability of Write Locks") Wwork: = flag. Int ("W", 1, "Amount of work for each writer") Rwork: = flag. Int ("R", "Amount of work for each reader") Readers: = flag. Int ("n", runtime. Gomaxprocs (0), "Total number of readers") Checkcpu: = flag. Uint64 ("C", +, "Update CPU estimate every n iterations") flag. Parse () var o *os. File if *cpuprofile {o, _: = OS. Create ("Rw.out") pprof. Startcpuprofile (o)} readers_per_core: = *readers/runtime. Gomaxprocs (0) var wg sync. Waitgroup var mx1 sync. Rwmutex Start1: = time. Now () for n: = 0; n < runtime. Gomaxprocs (0); n++ {for r: = 0; r < Readers_per_core; r++ {WG. ADD (1) go func () {defer WG. Done () r: = Rand. New (Rand. Newsource (Rand. Int63 ()))) for N: = UInt64 (0); n < *locks; n++ {if R.float64 () < *write {Mx1. Lock () x: = 0 for I: = 0; i < *wwork; i++ {x + +} _ = x mx1. Unlock ()} else {mx1. Rlock () x: = 0 for I: = 0; i < *rwork; i++ {x + +} _ = Mx1. Runlock ()}}} ()}} WG. Wait () End1: = time. Now () T1: = End1. Sub (Start1) fmt. Println ("Mx1", runtime. Gomaxprocs (0), *readers, *locks, *write, *wwork, *rwork, *checkcpu, T1. Seconds (), T1) if *cpuprofile {pprof. Stopcpuprofile () o.close () o, _ = os. Create ("Rw2.out") pprof. Startcpuprofile (o)} MX2: = Make (RWMutex2, Len (CPUs)) Start2: = time. Now () for n: = 0; n < runtime. Gomaxprocs (0); n++ {for r: = 0; r < Readers_per_core; r++ {WG. ADD (1) go func () {defer WG. Done () c: = Cpus[cpu ()] r: = Rand. New (Rand. Newsource (Rand. Int63 ()))) for N: = UInt64 (0); n < *locks; n++ {if *checkcpu! = 0 && N%*checkcpu = = 0 {c = cpus[cpu ()] } if R.float64 () < *write {MX2. Lock () x: = 0 for I: = 0; i < *wwork; i++ {x + +} _ = Mx2. Unlock ()} else {Mx2[c]. Rlock () x: = 0 for I: = 0; i < *rwork; i++ {x + +} _ = x MX2[C].R Unlock ()}}} ()}} WG. Wait () End2: = time. Now () pprof. Stopcpuprofile () O.close () t2: = End2. Sub (Start2) fmt. Println ("MX2", runtime. GOMaxprocs (0), *readers, *locks, *write, *wwork, *rwork, *checkcpu, T2. Seconds (), T2)}
All translations in this article are for learning and communication purposes only, please be sure to indicate the translator, source, and link to this article.
Our translation work in accordance with the CC agreement, if our work has violated your rights and interests, please contact us promptly