This is a creation in Article, where the information may have evolved or changed.
Two days ago a netizen in micro-blog private messages I such a question:
Sorry to bother you with a question about go: I am clear about the concept of goroutine, but I am puzzled by the Goroutine scheduling problem, according to the "Go Language Programming" book: "When a task is executing, there is no way outside to terminate it." To perform a task switch, you can only actively assign CPU usage by invoking yield () from the task itself. "Well, suppose my goroutine is a dead loop, is there no chance that other goroutine will execute?" The result of my test is that these goroutine will be executed in turn. So in addition to Syscall will be active to transfer CPU time outside, my death cycle goroutine between how to do switch?
I made a reply at the first time. However, since I do not know the specifics, I have made the assumption in my reply that the user's dead loop does not invoke any code that can surrender the execution power . Afterwards, after his reply, the netizen said the real reason for the death cycle Goroutine switch: He called the FMT in the dead loop. Println.
What do you think you should write about this question afterwards? So I conceived an article designed to follow the Netizen's thinking through a number of examples to show how to analyze go schedule in step by step. If you don't know anything about Goroutine's schedule, read this preamble, "also talk about Goroutine scheduler."
Why, with the participation of Deadloop, multiple goroutine will still be executed in turn
We first look at CASE1, we follow the idea of the Netizen to construct the first example, and replied: "Why in Deadloop's participation, many goroutine will still be executed in turn?" "This question. Here is the source code for CASE1:
//github.com/bigwhite/experiments/go-sched-examples/case1.gopackage mainimport ( "fmt" "time")func deadloop() { for { }}func main() { go deadloop() for { time.Sleep(time.Second * 1) fmt.Println("I got scheduled!") }}
In Case1.go, we launched two Goroutine, one is main goroutine, the other is Deadloop goroutine. Deadloop Goroutine As the name implies, its logic is a dead loop, and the main goroutine for the convenience of display, also used a "dead loop", and every second to print a message. Run this example on my MacBook Air (my machine is a two-core four thread, and the runtime's NUMCPU function returns 4):
$go run case1.goI got scheduled!I got scheduled!I got scheduled!... ...
From the output of the running results of the log, despite the existence of Deadloop Goroutine, main goroutine still got the dispatch. The root cause is that the machine is multi-core multi-threaded (hardware thread, not OS thread). Go from the 1.5 version of the default number of p to = the number of CPU cores (and actually multiplied by the number of hard threads on each core), so case1 created more than one p at startup, we use a picture to explain:
We assume that Deadloop Goroutine is dispatched with P1, P1 runs on M1 (corresponding to an OS kernel thread), while main goroutine is dispatched to P2, P2 runs on M2, M2 corresponds to another OS kernel Thread, and OS kernel threads is dispatched to the physical CPU core at the operating system scheduling level, and we have multiple cores, and even if the deadloop is full of a core, we can also run the main on the other CPU core on the P2 Goroutine, this is also the reason why main Goroutine gets dispatched.
Tips: See your hardware CPU core count and total hardware threads on Mac OS:
$sysctl -n machdep.cpu.core_count2$sysctl -n machdep.cpu.thread_count4
Second, how to let Deadloop goroutine outside the goroutine can not be dispatched?
What should we do if we have to deadloop the goroutine outside of the goroutine to get the dispatch? One way of thinking: let go runtime do not start so many p, so that all user-level goroutines on a P is dispatched.
Three ways:
- Call runtime at the very beginning of the main function. Gomaxprocs (1);
- Set environment variables export Gomaxprocs=1 before running the program
- Find a single-core one-threaded machine ^0^ (now such machines are too hard to come by, only with cloud servers)
Let's take the first approach as an example:
//github.com/bigwhite/experiments/go-sched-examples/case2.gopackage mainimport ( "fmt" "runtime" "time")func deadloop() { for { }}func main() { runtime.GOMAXPROCS(1) go deadloop() for { time.Sleep(time.Second * 1) fmt.Println("I got scheduled!") }}
After running this program, you will find that the "I got scheduled" of main goroutine can no longer be output. The scheduling principle here can be illustrated by the following illustration:
Deadloop Goroutine is dispatched on P1, because deadloop internal logic does not give the scheduler any chance of preemption, such as: Enter Runtime.morestack_noctxt. So even if the sysmon such a monitoring goroutine, it is only able to give Deadloop goroutine the preemption flag bit set to true. Because there is no opportunity to enter the scheduler code inside Deadloop, Goroutine rescheduling will never occur. Main goroutine can only lie in P1 's local queue.
Three, inversion: how in the case of Gomaxprocs=1, let main goroutine get the dispatch?
Let's do a reversal: How to get the main goroutine to be dispatched in the case of Gomaxprocs=1? I heard that in go "there is a function call, there is the opportunity to enter the scheduler code", we have to test whether it is true. We add a function call to the for-loop logic of Deadloop Goroutine:
// github.com/bigwhite/experiments/go-sched-examples/case3.gopackage mainimport ( "fmt" "runtime" "time")func add(a, b int) int { return a + b}func deadloop() { for { add(3, 5) }}func main() { runtime.GOMAXPROCS(1) go deadloop() for { time.Sleep(time.Second * 1) fmt.Println("I got scheduled!") }}
We added an Add function call to the Deadloop goroutine for loop. Let's run this program to see if we can achieve our goal:
$ go run case3.go
"I got scheduled!" The words still do not appear in front of us! That is to say, main goroutine not get dispatched! Why is it? In fact, the so-called "have a function call, there is the opportunity to enter the scheduler code", is actually go compiler at the entrance of the function to insert a runtime function call: Runtime.morestack_noctxt. This function checks to see if a continuous stack is scaled and enters the logic of the preemption schedule. Once the goroutine is placed to be preempted, the preemption code deprives the goroutine of execution and gives it to other goroutine. But why didn't the above code achieve this? We need to look at the assembly level to see what the go compiler generated code looks like.
There are a number of ways to see the assembly code of the Go program:
- Using the Objdump tool: Objdump-s go-binary
- Using GDB disassemble
- Build Go program to generate assembly code files simultaneously: Go build-gcflags '-S ' xx.go > Xx.s 2>&1
- Compiling the Go Code into assembly code: Go Tool compile-s xx.go > Xx.s
- Using Go tool to Decompile Go program: Go tool objdump-s go-binary > Xx.s
We use the last method here: Use the Go tool objdump to decompile (and combine the other output's assembly form):
$go build -o case3 case3.go$go tool objdump -S case3 > case3.s
Open Case3.s, search Main.add, we can't find the assembly code of this function, and the definition of Main.deadloop is as follows:
TEXT main.deadloop(SB) github.com/bigwhite/experiments/go-sched-examples/case3.go for { 0x1093a10 ebfe JMP main.deadloop(SB) 0x1093a12 cc INT $0x3 0x1093a13 cc INT $0x3 0x1093a14 cc INT $0x3 0x1093a15 cc INT $0x3 ... ... 0x1093a1f cc INT $0x3
We see that the call to add in Deadloop also disappears. This is obviously the result of Go compiler performing the generated code optimizations, because the call to add has no effect on the deadloop behavior results. We turn off optimization and try again:
$go build -gcflags '-N -l' -o case3-unoptimized case3.go$go tool objdump -S case3-unoptimized > case3-unoptimized.s
Open Case3-unoptimized.s Find Main.add, this time we found it:
TEXT main.add(SB) github.com/bigwhite/experiments/go-sched-examples/case3.gofunc add(a, b int) int { 0x1093a10 48c744241800000000 MOVQ $0x0, 0x18(SP) return a + b 0x1093a19 488b442408 MOVQ 0x8(SP), AX 0x1093a1e 4803442410 ADDQ 0x10(SP), AX 0x1093a23 4889442418 MOVQ AX, 0x18(SP) 0x1093a28 c3 RET 0x1093a29 cc INT $0x3... ... 0x1093a2f cc INT $0x3
An explicit call to add is also available in the Deadloop:
TEXT main.deadloop(SB) github.com/bigwhite/experiments/go-sched-examples/case3.go ... ... 0x1093a51 48c7042403000000 MOVQ $0x3, 0(SP) 0x1093a59 48c744240805000000 MOVQ $0x5, 0x8(SP) 0x1093a62 e8a9ffffff CALL main.add(SB) for { 0x1093a67 eb00 JMP 0x1093a69 0x1093a69 ebe4 JMP 0x1093a4f... ...
But the main goroutine in our program is still not scheduled, Because in the Main.add code, we do not find the trace of the Morestack function, that is, even if the add function is called, Deadloop has no chance to go into the runtime's dispatch logic.
But why does go compiler not insert a morestack call in the Main.add function? That's because the Add function is located in the leaf (leaf) position of the call tree, and compiler ensures that it no longer generates new stack frames, does not cause the stack to split or exceeds the existing stack boundary, and then no longer inserts the Morestack. The preemptive check of the scheduler in Morestack cannot be executed. The following is the assembly output of the case3.go of Go build-gcflags '-s ' mode output:
"".add STEXT nosplit size=19 args=0x18 locals=0x0 TEXT "".add(SB), NOSPLIT, $0-24 FUNCDATA $0, gclocals·54241e171da8af6ae173d69da0236748(SB) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) MOVQ "".b+16(SP), AX MOVQ "".a+8(SP), CX ADDQ CX, AX MOVQ AX, "".~r2+24(SP) RET
We see the word nosplit, which means that the stack used by add is a fixed size, no more split, and a size of 24 bytes.
about whether the leaf function in the for loop should be inserted morestack there is still some controversy about the possibility of special treatment for such a situation in the future.
Now that we understand the principle, we'll add a dummy function between Deadloop and add, see the Case4.go code below:
//github.com/bigwhite/experiments/go-sched-examples/case4.gopackage mainimport ( "fmt" "runtime" "time")//go:noinlinefunc add(a, b int) int { return a + b}func dummy() { add(3, 5)}func deadloop() { for { dummy() }}func main() { runtime.GOMAXPROCS(1) go deadloop() for { time.Sleep(time.Second * 1) fmt.Println("I got scheduled!") }}
Execute the code:
$go run case4.goI got scheduled!I got scheduled!I got scheduled!
wow! The main goroutine really got the dispatch. Let's take a look at the assembly code generated by go compiler for the program:
$go build-gcflags '-n-l '-o case4 case4.go$go tool objdump-s case4 > Case4.stext main.add (SB) Github.com/bigwhite/ex Periments/go-sched-examples/case4.gofunc Add (A, b int) int {0x1093a10 48c744241800000000 movq $0x0, 0x1 8 (SP) return a + b 0x1093a19 488b442408 movq 0x8 (sp), AX 0x1093a1e 480344241 0 ADDQ 0x10 (sp), ax 0x1093a23 4889442418 movq ax, 0x18 (sp) 0x1093a28 C 3 RET 0x1093a29 cc INT $0x3 0x1093a2a cc INT $0x3 ... TEXT main.dummy (SB) Github.com/bigwhite/experiments/go-sched-examples/case4.sfunc dummy () {0x1093a30 65488b0 c25a0080000 movq gs:0x8a0, CX 0x1093a39 483b6110 CMPQ 0x10 (CX), SP 0x1093a3d 762e jbe 0x1093a6d 0x1093a3f 4883ec20 subq $0x20, SP 0x1093a43 48896c2418 movq BP, 0x18 (sp) 0x1093a48 488d6c2418 leaq 0x18 (sp), BP Add (3, 5) 0x1093a4d 48c7042403000000 movq $0x3, 0 (SP) 0x1093a55 48c744240805000000 movq $0x 5, 0x8 (SP) 0x1093a5e e8adffffff call Main.add (SB)} 0x1093a63 488b6c2418 MOVQ 0x18 (SP), BP 0x1093a68 4883c420 addq $0x20, SP 0x1093a6c C3 RET 0x1093a6d E86EACFBFF Call Runtime.morestack_noctxt (SB) 0x1093a72 EBBC JMP main.dummy (SB) 0x1093a74 cc INT $0x3 0x1093a75 cc int $0x3 0x1093a76 cc int $0x3 .....
We see that the Main.add function is still a leaf, not morestack inserted, but in the new dummy function we see the figure of call Runtime.morestack_noctxt (SB).
Iv. why did Runtime.morestack_noctxt (SB) put behind the RET?
In the traditional impression, Morestack is placed at the entrance of the function. But in the actual compiled assembly code (see the assembly of function dummy above), Runtime.morestack_noctxt (SB) is placed behind the RET. To explain the problem, we'd better take a look at another form of assembly output (Go build-gcflags '-s ' mode output format):
"".dummy STEXT size=68 args=0x0 locals=0x20 0x0000 00000 TEXT "".dummy(SB), $32-0 0x0000 00000 MOVQ (TLS), CX 0x0009 00009 CMPQ SP, 16(CX) 0x000d 00013 JLS 61 0x000f 00015 SUBQ $32, SP 0x0013 00019 MOVQ BP, 24(SP) 0x0018 00024 LEAQ 24(SP), BP ... ... 0x001d 00029 MOVQ $3, (SP) 0x0025 00037 MOVQ $5, 8(SP) 0x002e 00046 PCDATA $0, $0 0x002e 00046 CALL "".add(SB) 0x0033 00051 MOVQ 24(SP), BP 0x0038 00056 ADDQ $32, SP 0x003c 00060 RET 0x003d 00061 NOP 0x003d 00061 PCDATA $0, $-1 0x003d 00061 CALL runtime.morestack_noctxt(SB) 0x0042 00066 JMP 0
We see at the entrance to the function that compiler inserts three lines of assembly:
0x0000 00000 MOVQ (TLS), CX // 将TLS的值(GS:0x8a0)放入CX寄存器 0x0009 00009 CMPQ SP, 16(CX) //比较SP与CX+16的值 0x000d 00013 JLS 61 // 如果SP > CX + 16,则jump到61这个位置
This form of output is the standard PLAN9 assembly syntax, the data is very small (such as the meaning of the JLS jump command), comments are also roughly guessed. If you jump, then go to Runtime.morestack_noctxt, return from Runtime.morestack_noctxt, and then execute the JMP to the beginning.
Why did you do it? According to go team, it is to make better use of the modern CPU "static branch prediction", improve the performance of execution.
V. References
- A Quick Guide to Go ' s assembler
- Go ' s work-stealing Scheduler
The code in this article can be downloaded by clicking here.
Weibo: @tonybai_cn
Public Number: Iamtonybai
Github.com:https://github.com/bigwhite
Appreciated:
, Bigwhite. All rights reserved.