Also talk about Goroutine scheduler

Last Update:2017-06-23 Source: Internet

Author: User

Tags gopher

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed.

The go language was once again the title of the Tiboe annual programming language in 2016, which amply demonstrates the popularity of the go language in the world over the years. If you want to launch a "What do you like Go" survey for gopher around the world, I believe many gopher will mention:goroutine.

Goroutine is the implementation of native support concurrency for the Go language, and your go code runs in Goroutine without exception. You can start many or even thousands of Goroutine,go's runtime responsible for managing the goroutine. The so-called management is "scheduling", Rough said scheduling is to decide when which Goroutine will get the resources to start execution, which goroutine should stop the execution of the yield resources, which goroutine should be awakened to resume execution and so on. Goroutine's schedule is a Go team care thing, and most gopher don't need to worry about it. But personally feel that proper understanding of Goroutine's scheduling model and principles is useful for writing better go code. So, in this article, I'm going to explore with you the evolution of the Goroutine Scheduler and the model/principle.

Note: Here to write is not the Goroutine Scheduler source analysis, the domestic rain mark teacher in its "Go Language study notes" a book under the "Source analysis" has done a detailed and high-quality analysis of the Go 1.5.1 Scheduler, to go Scheduler's realization of special interest Gopher can go to this book to ^0^. Here about Goroutine Scheduler Introduction is mainly reference to go team about scheduler of various design doc, foreign gopher published information about scheduler, of course, rain Mark Teacher's book also gave me a lot of revelation.

One, Goroutine Scheduler

Referring to "scheduling", the first thing we think about is the operating system on the process, thread scheduling. The operating system scheduler dispatches multiple threads in the system to the physical CPU to run on a certain algorithm. Traditional programming languages such as C, C + + and other concurrent implementations are actually based on the operating system scheduling, that is, the program is responsible for creating the thread (generally through the pthread and other lib call implementation), the operating system is responsible for scheduling. This traditional way of supporting concurrency has many drawbacks:

Complex
- Easy to create, exit difficult: The children's shoes that have been made of C/+ + programming know that creating a thread (such as using pthread) Although the parameters are many, but at least acceptable. But when it comes to the exit of thread, it is necessary to consider whether thread is detached, or need parent thread to join? Do you need to set the cancel point in thread to ensure that the join will exit smoothly?
- Communication between concurrent units is difficult and error-prone: although there are many mechanisms available for communication between multiple thread types, it is quite complex to use, and once a shared memory is involved, locks are used, and deadlocks become commonplace;
- Thread stack size: Is it the default, or is it larger, or smaller?
Difficult to scaling
- The cost of a thread has been much smaller than the process, but we still can't create a lot of thread, because in addition to each thread occupies a small amount of resources, operating system scheduling switch thread is not a small cost;
- For many network service programs, due to the inability to create a large number of thread, it is necessary to do network multiplexing in a small number of thread, namely: the use of epoll/kqueue/iocompletionport this mechanism, even if there is a third-party library such as Libevent/libev help , it is very difficult to write such a program, there are a lot of callback, to the programmer to bring a small mental burden.

To do this, go uses the concept of user-level lightweight thread or class coroutine to solve these problems, and go calls it "goroutine". Goroutine consumes very little resources (Go 1.4 Sets the size of each goroutine stack to 2k by default), and the Goroutine scheduled switchover does not have to fall into the (trap) OS kernel layer, which is very inexpensive. As a result, thousands of concurrent goroutine can be created in a go program. All the go codes are executed in Goroutine, even if the go runtime is not the exception. Programs that are placed on the "CPU" by these goroutines as an algorithm are called goroutine schedulers or goroutine scheduler.

However, a go program for the operating system is only a user-level program , for the operating system, its eyes only thread, it does not even know what is called goroutine of the existence of something. Goroutine scheduling all to go by themselves, to achieve the "fair" in The Go Program goroutine "CPU" resources, this task fell to go runtime head, to know in a go program, in addition to user code, the rest is go runtime.

So the Goroutine scheduling problem evolved to go runtime how to put many goroutine in the program to the "CPU" resources to run in accordance with a certain algorithm. At the operating system level, thread competing "CPU" resources are real physical CPUs, but at the Go program level, what are the "CPU" resources that each goroutine to compete for? The Go program is the user-level program, which itself is run on one or more operating system threads, so the so-called "CPU" resources that goroutine compete with are operating system threads. The task of Go Scheduler is clear: The Goroutines is placed in a different operating system thread according to a certain algorithm. This comes with a language-level scheduler, which we call native support concurrency .

Two, Go Scheduler model and evolution process

1. G-M model

March 28, 2012, Go 1.0 officially released. In this release, Go team implements a simple scheduler. In this scheduler, each goroutine corresponds to an abstract structure in Runtime: G, and the OS thread is abstracted as a "physical CPU" as a structure: M (machine). Although this structure is simple, there are many problems. Dmitry Vyukov, a former Intel Blackbelt engineer and now Google engineer, points out an important deficiency in the Scheduler model in its scalable Go g-m Design article: Limits the scalability of go concurrency programs, especially for service programs that have high throughput or parallel computing needs. Mainly reflected in the following aspects:

The existence of a single global mutex (Sched.lock) and centralized state storage causes all goroutine related operations, such as creating, rescheduling, and so on, to be locked;
Goroutine transfer problem: M often passes "operational" goroutine between M, which leads to increased scheduling delay and additional performance loss;
Each m do memory cache, resulting in high memory consumption, poor data locality;
Severe worker thread blocking and unblocking due to syscall calls, resulting in additional performance loss.

2. G-p-m model

So Dmitry Vyukov personally improved go scheduler, implemented in Go 1.1 g-p-m scheduling model and work stealing algorithm, this model has been used today:

A famous person once said:"Any problem in computer science can be solved by adding an indirect middle layer", I think the g-p-m model of Dmitry Vyukov is just the practitioner of this theory. Dmitry Vyukov implements the scalable of Go scheduler by adding a p to the G-M model.

P is a "logic proccessor", each G want to really run up, first need to be assigned a P (into the local runq of P, here temporarily ignore the global runq that link). For G, P is the "CPU" that runs it, so to speak:G's eye is only P. But from the GO Scheduler point of view, the real "CPU" is M, only the p and M binding to let P's runq in the real run up. This relationship between P and M is like the corresponding relationship (N x m) between the user thread of the Linux operating system scheduling plane and the core thread (kernel thread).

3. Preemptive scheduling

The implementation of the G-P-M model is a big step forward for Go scheduler, but scheduler still has a headache that does not support preemptive scheduling, resulting in a dead loop or permanent loop of code logic in a G, so that G will permanently occupy the P and M assigned to it, The other G in the same p will not be dispatched, there is a " starve " situation. More seriously, when there is only one P (Gomaxprocs=1), all other g in the Go program will "starve". So Dmitry Vyukov also put forward the "Go preemptive Scheduler Design" and implemented in Go 1.2 "preemptive" scheduling.

The principle of this preemptive dispatch is that at the entrance of each function or method, plus an additional piece of code, the runtime has the opportunity to check if a preemption schedule is required. This solution can only be said to partially solve the "starvation" problem, for no function calls, the G,scheduler of the Pure algorithm loop computation still cannot preempt.

4. NUMA scheduling model

From Go 1.2, go seems to focus on the optimization of the GC's low latency, the optimization and improvement of the scheduler seems less enthusiastic, but with the GC improvements and made minor changes. In September 2014, Dmitry Vyukov proposed a new proposal design doc: Numa‐aware Scheduler for Go, a proposal for the future evolution of Go Scheduler, But so far it seems that the proposal has not been included in the development plan.

5. Other optimization

The GO runtime has implemented Netpoller, which allows the G-initiated network I/O operation to not cause M to block (block G only), thus not causing a large number of m to be created. However, if the I/O operation for the regular file is blocked, then M will go into sleep state, wait for I/O to return and wake up, in which case p will detach from sleep m and select an idle M. If there is no idle m at this time, a new m is created, which is why a large number of I/O operations cause a large number of thread to be created.

Ian Lance Taylor added a poller for OS package feature in the Go 1.9 dev cycle, which, like Netpoller, blocks only g when the G operation supports Pollable FD, without blocking m. However, this function is still not valid for regular file, regular file is not pollable. However, for scheduler, this is a step forward.

Third, go scheduler principle of further understanding

1, G, P, M

For the definition of G, P, M, you can refer to $goroot/src/runtime/runtime2.go this source file. Each of the three structs is a chunk header, and each struct definition contains more than 10 or even controlled fields. Core code such as scheduler has always been complex, and there are many factors to consider, and the code is "coupled" into a lump. However, from the complex code, we can still see the G, P, m of their approximate use (Of course, rain marks the teacher's source analysis can not be done), here briefly explain:

g: Represents Goroutine, stores the goroutine's execution stack information, goroutine State, and Goroutine task functions, etc., and the G object can be reused.
P: Indicates that the number of logical processor,p determines the maximum number of parallel g in the system (provided that the number of physical CPU cores >=p the system), and the maximum function of P is its various G-object queues, linked lists, some caches, and states. The
m:m represents a true execution of compute resources. After binding the valid P, enter the schedule loop, and the mechanism of the schedule loop is to get G from various queues, p's local queue, switch to G's execution stack and execute G's function, call Goexit to do cleanup work and go back to M, so again and again. M does not retain the G-State, which is the basis for G to dispatch across M.

下面是G、P、M定义的代码片段：//src/runtime/runtime2.gotype g struct {        stack      stack   // offset known to runtime/cgo        sched     gobuf        goid        int64        gopc       uintptr // pc of go statement that created this goroutine        startpc    uintptr // pc of goroutine function        ... ...}type p struct {    lock mutex    id          int32    status      uint32 // one of pidle/prunning/...    mcache      *mcache    racectx     uintptr    // Queue of runnable goroutines. Accessed without lock.    runqhead uint32    runqtail uint32    runq     [256]guintptr    runnext guintptr    // Available G's (status == Gdead)    gfree    *g    gfreecnt int32  ... ...}type m struct {    g0      *g     // goroutine with scheduling stack    mstartfn      func()    curg          *g       // current running goroutine .... ..}

2, G is preempted scheduling

and the operating system according to the time slice scheduling thread different, go does not have the concept of time slice. If a G does not make system call calls, does not perform I/O operations, and does not block on a channel operation, how does M let G stop and dispatch the next runnable G ? The answer is: g is a preemptive dispatch.

As I said before, the Go runtime has the chance to preempt G unless there is an extreme infinite loop or a dead loop, as long as G calls the function. When the Go program starts, the runtime will launch a m called Sysmon (typically called a monitoring thread), which can run without binding p, which is critical to the entire go program:

//$GOROOT/src/runtime/proc.go// The main goroutine.func main() {     ... ...    systemstack(func() {        newm(sysmon, nil)    })    .... ...}// Always runs without a P, so write barriers are not allowed.////go:nowritebarrierrecfunc sysmon() {    // If a heap span goes unused for 5 minutes after a garbage collection,    // we hand it back to the operating system.    scavengelimit := int64(5 * 60 * 1e9)    ... ...    if  .... {        ... ...        // retake P's blocked in syscalls        // and preempt long running G's        if retake(now) != 0 {            idle = 0        } else {            idle++        }       ... ...    }}

Sysmon starts every 20us~10ms, according to the "Go language learning Notes" in the summary, Sysmon mainly to complete the following work:

Release span physical memory that is idle for more than 5 minutes;
If more than 2 minutes are not garbage collected, enforcement;
Add long unhandled Netpoll results to the task queue;
A preemptive dispatch is issued to a long-running G-task;
Recover the p that has been blocked for a long time due to syscall;

We see that Sysmon will "issue a preemption schedule to a long-running G-Task", which is implemented by retake:

// forcePreemptNS is the time slice given to a G before it is// preempted.const forcePreemptNS = 10 * 1000 * 1000 // 10msfunc retake(now int64) uint32 {          ... ...           // Preempt G if it's running for too long.            t := int64(_p_.schedtick)            if int64(pd.schedtick) != t {                pd.schedtick = uint32(t)                pd.schedwhen = now                continue            }            if pd.schedwhen+forcePreemptNS > now {                continue            }            preemptone(_p_)         ... ...}

As you can see, if a G task runs 10ms,sysmon it will assume that it is running for too long and make a preemptive dispatch request. Once the G's preemption flag bit is set to true, then when this G next Call function or method, runtime can take G preemption, and move out of the running state, put in P's local runq, waiting for the next time to be dispatched.

3, channel blocking or network I/O in the case of scheduling

If G is blocked on a channel operation or network I/O operation, G is placed in a wait queue, and M tries to run the next runnable G, and if no runnable g is running for M, then M will unbind P and go to sleep state. When the I/O available or channel operation is complete, the G in the wait queue is awakened, marked as runnable, placed in a queue of p, and bound to a m to continue execution.

4, the system call blocking situation of the dispatch

If G is blocked on a system call operation, then not only G will block, and the m that executes the G will also unbind P (which is essentially snatched by Sysmon) and enter sleep state with G. If there is an idle m at this point, p continues to execute the other G with its bindings, and if there is no idle m, but there are still other G to execute, a new m is created.

When block G on Syscall completes the Syscall call, G will attempt to obtain a usable p, if no p is available, then G will be marked as runnable, and the previous sleep m will enter sleep again.

Iv. How to view scheduler status

Go provides a way to view the current state of the scheduler: Use the GO runtime environment variable Godebug.

$GODEBUG=schedtrace=1000 godoc -http=:6060SCHED 0ms: gomaxprocs=4 idleprocs=3 threads=3 spinningthreads=0 idlethreads=0 runqueue=0 [0 0 0 0]SCHED 1001ms: gomaxprocs=4 idleprocs=0 threads=9 spinningthreads=0 idlethreads=3 runqueue=2 [8 14 5 2]SCHED 2006ms: gomaxprocs=4 idleprocs=0 threads=25 spinningthreads=0 idlethreads=19 runqueue=12 [0 0 4 0]SCHED 3006ms: gomaxprocs=4 idleprocs=0 threads=26 spinningthreads=0 idlethreads=8 runqueue=2 [0 1 1 0]SCHED 4010ms: gomaxprocs=4 idleprocs=0 threads=26 spinningthreads=0 idlethreads=20 runqueue=12 [6 3 1 0]SCHED 5010ms: gomaxprocs=4 idleprocs=0 threads=26 spinningthreads=1 idlethreads=20 runqueue=17 [0 0 0 0]SCHED 6016ms: gomaxprocs=4 idleprocs=0 threads=26 spinningthreads=0 idlethreads=20 runqueue=1 [3 4 0 10]... ...

Godebug this go runtime environment variable is very powerful, by passing it in different key1=value1,key2=value2 ... Combination, go runtime will output different debugging information, such as here we give Godebug passed "Schedtrace=1000″, its meaning is every 1000ms, print output once Goroutine scheduler state, each row." The fields in each row have the following meanings:

以上面例子中最后一行为例：SCHED 6016ms: gomaxprocs=4 idleprocs=0 threads=26 spinningthreads=0 idlethreads=20 runqueue=1 [3 4 0 10]SCHED：调试信息输出标志字符串，代表本行是goroutine scheduler的输出；6016ms：即从程序启动到输出这行日志的时间；gomaxprocs: P的数量；idleprocs: 处于idle状态的P的数量；通过gomaxprocs和idleprocs的差值，我们就可知道执行go代码的P的数量；threads: os threads的数量，包含scheduler使用的m数量，加上runtime自用的类似sysmon这样的thread的数量；spinningthreads: 处于自旋状态的os thread数量；idlethread: 处于idle状态的os thread的数量；runqueue=1： go scheduler全局队列中G的数量；[3 4 0 10]: 分别为4个P的local queue中的G的数量。

We can also output detailed scheduling information for each goroutine, M, and P, but for go user, most of the time this is unnecessary:

$ godebug=schedtrace=1000,scheddetail=1 godoc-http=:6060sched 0ms:gomaxprocs=4 idleprocs=3 threads=3 spinningthreads  =0 idlethreads=0 runqueue=0 gcwaiting=0 nmidlelocked=0 stopwait=0 sysmonwait=0 p0:status=1 schedtick=0 syscalltick=0 m=0 Runqsize=0 gfreecnt=0 p1:status=0 schedtick=0 syscalltick=0 m=-1 runqsize=0 gfreecnt=0 p2:status=0 schedtick=0 syscal Ltick=0 m=-1 runqsize=0 gfreecnt=0 p3:status=0 schedtick=0 syscalltick=0 m=-1 runqsize=0 gfreecnt=0 m2:p=-1 curg=-1 ma llocing=0 throwing=0 preemptoff= locks=1 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1 m1:p=-1 curg=17 malloc ing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=false lockedg=17 m0:p=0 curg=1 mallocing=0  throwing=0 preemptoff= locks=1 dying=0 helpgc=0 spinning=false blocked=false lockedg=1 g1:status=8 () m=0 lockedm=0 G17: Status=3 () m=1 lockedm=1sched 1002ms:gomaxprocs=4 idleprocs=0 threads=13 spinningthreads=0 idlethreads=7 runqueue=6 GCW Aiting=0 nmidlelocked=0 STOPWAit=0 sysmonwait=0 p0:status=2 schedtick=2293 syscalltick=18928 m=-1 runqsize=12 gfreecnt=2 P1:status=1 schedtick=2356  syscalltick=19060 m=11 runqsize=11 gfreecnt=0 p2:status=2 schedtick=2482 syscalltick=18316 m=-1 runqsize=37 gfreecnt=1 p3:status=2 schedtick=2816 syscalltick=18907 m=-1 runqsize=2 gfreecnt=4 m12:p=-1 curg=-1 mallocing=0 throwing=0 preempt off= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1 m11:p=1 curg=6160 mallocing=0 throwing=0 Preemptoff = locks=2 dying=0 helpgc=0 spinning=false blocked=false lockedg=-1 m10:p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= L Ocks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1 ... SCHED 2002ms:gomaxprocs=4 idleprocs=0 threads=23 spinningthreads=0 idlethreads=5 runqueue=4 gcwaiting=0 nmidlelocked=0 Stopwait=0 sysmonwait=0 p0:status=0 schedtick=2972 syscalltick=29458 m=-1 runqsize=0 gfreecnt=6 p1:status=2 schedtick= 2964 syscalltick=33464 m=-1 runqsize=0 gfreecnt=39 P2:status=1 schedtick=3415 syscalltick=33283 m=18 runqsize=0 gfreecnt=12 p3:status=2 schedtick=3736 syscalltick=33701 m=-1 runqsize=1 gfreecnt=6 M22:p=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1 M21 : P=-1 curg=-1 mallocing=0 throwing=0 preemptoff= locks=0 dying=0 helpgc=0 spinning=false blocked=true lockedg=-1 ... ...

For more information about the GO scheduler debug information output, refer to Dmitry Vyukov's masterpiece: Debugging performance issues in Go programs. This should also be the classic article that every gopher must read. Of course, more detailed code can refer to the Schedtrace function in $goroot/src/runtime/proc.go.

Weibo: @tonybai_cn
Public Number: Iamtonybai
Github.com:https://github.com/bigwhite

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More