Have a chat goroutine stack

Source: Internet
Author: User
Tags subq
This is a creation in Article, where the information may have evolved or changed.

It plays an important role in the push-to-sell ordering, providing a basic data channel for real-time business orders and rider Dispatch. Early push is provided by a third-party service provider, and as business complexity increases, order volume and number of users continue to grow, the previous system is far from enough to meet demand, and building a high-performance, highly available push system is imperative. In the first half of this year we used go to develop a hybrid push service, users online with a long connection to the message, not the line with the use of vendors or third-party channels to distribute messages. In the construction process encountered some issues related to the Goroutine stack, here and everyone to pull a tear.

Reading with questions can make reading more efficient, first let's look at the question:

    1. How big is Goroutine stack? Is it a fixed or dynamic change?
    2. When the stack is dynamically changing, when does it expand and shrink? How is it implemented?
    3. What is the impact on the service? How to troubleshoot the problem of stack expansion and contraction?

When the problem is clear, we start to pull it down.

Stack size

Before we get to know stacks, we'll look at the traditional Linux process memory layout:

The size of the user stack is fixed, the default is 8192KB in Linux, the run-time memory consumption exceeds the limit, and the program crashes and reports segment errors. To fix this problem, we can either resize the stack size in the kernel parameter or explicitly pass in the required size memory block when the thread is created. Both of these schemes have their own advantages and disadvantages, the former is relatively simple but will affect all the thread in the system, the latter requires developers to accurately calculate the size of each thread, the burden is relatively high.

Is there any way to not affect all the thread and add too much burden to the developer? The answer is, of course, some, such as: we can be inserted in the function call, each call to check whether the current stack of space to meet the new function execution, satisfied with the words directly executed, otherwise create a new stack space and copy the old stack to the new stack and then execute. This idea sounds very fancy & simple, but the current Linux thread model is not satisfied, the implementation of the words can only be implemented in user space, and there is no small difficulty.

Go as a modern language in the 21st century, positioning in simple and efficient, take full advantage of the multi-core advantage, liberate engineers, naturally can not be less this feature. It gracefully solves this problem with the built-in runtime runtime, with each routine (except for G0) having a stack size of 2KB at initialization, which dynamically adjusts to different scenarios during the run.

Stack expansion and shrink capacity

Before introducing specific stack processing details, let's look at the memory layout of the stacks and some important terms:

    • Stack.lo: Low address of stack space
    • Stack.hi: High address of stack space
    • Stackguard0:stack.lo + stackguard for stack overlow detection
    • Stackguard: Protected Area size, 880 bytes on constant Linux
    • Stacksmall: A constant size of 128 bytes for optimization of small function calls

In determining whether the stack space needs to be expanded, you can be divided into the following two scenarios according to the size of the called function stack frame:

    • Less than Stacksmall

      SP is less than stackguard0, execution stack expansion, otherwise directly executed.

    • Greater than STACKSAMLL

      Sp-function ' s stack Frame Size + stacksmall is less than stackguard0, performing stack expansion, otherwise directly executed.

Runtime also has a Stackbig constant, the default is 4096, the call function stack frame size is greater than stackbig, will occur when the stack expansion, here is no longer expanded.

Below is a simple function call to observe the processing of the stack:

 package  ma Infunc  main   ()   {A, b: = 1 , 2  _ = Add1 (A, b) _ = Add2 (A, b)}
       
        func  
        add1  
         (x, y span class= "keyword" >int )  
        int   {
        return  x + y}
        func  add2   (x, y int )  int   {_ = 
        make  ([]
        byte , 
        200 ) 
        return  x + y} 
       

Disable optimizations and introverted compilation go tool compile -N -l -S stack.go > stack.s , some of the assembly code is as follows:

"". Main t=1 size=112 args=0x0 locals=0x30//stack size is 48, no parameter 0x0000 00000 (stack.go:3) TEXT "". Main (SB), $48-0//via thread local St Orage Gets the data structure of the current g (G-Goroutine) 0x0000 00000 (stack.go:3) movq (TLS), cx//compare SP and g.stackguard00x0009 00009 (Stack.go:3) CMPQSP, (CX)//less than g.stackguard0,jump to 105 execution stack expansion 0x000d 00013 (stack.go:3) jls105//continue execution 0x000f 00015 (stack.go:3) subq$48 , sp0x0013 00019 (Stack.go:3) movqbp, (SP) 0x0018 00024 (stack.go:3) LEAQ40 (SP), bp//for garbage collection 0x001d 00029 (stack.go:3) Funcdata$0, Gclocals 33CDECCCCEBE80329F1FDBEE7F5874CB (SB) 0x001d 00029 (stack.go:3) funcdata$1, gclocals· 33CDECCCCEBE80329F1FDBEE7F5874CB (SB) 0x001d 00029 (stack.go:4) movq$1, "". A+32 (SP) 0x0026 00038 (stack.go:4) MOVQ$2, "". B+24 (SP)//Put a into the AX register 0x002f 00047 (stack.go:5) movq "". A+32 (SP), ax//parameter a pressure stack 0x0034 00052 (stack.go:5) Movqax, (SP)// Put B into the AX register 0x0038 00056 (stack.go:5) movq "". B+24 (SP), ax//parameter B pressure stack 0x003d 00061 (stack.go:5) Movqax, 8 (SP) 0x0042 00066 ( Stack.go:5) pcdata$0, $0//calls add10x0042 00066 (stack.go:5) Call "". ADD1 (SB)//Put a into the ax register 0x0047 00071 (Stack.go:6) movq "". A+32 (SP), ax//parameter a pressure stack 0x004c 00076 (stack.go:6) Movqax, (SP)//put B into AX register 0x0050 00080 (stack.go:6 ) movq "". B+24 (SP), ax//parameter B pressure stack 0x0055 00085 (stack.go:6) Movqax, 8 (SP) 0x005a 00090 (stack.go:6) pcdata$0, $0//call add20x005a 00090 (stack.go:6) Call "". ADD2 (SB) 0x005f 00095 (stack.go:7) MOVQ40 (SP), bp0x0064 00100 (stack.go:7) addq$48, sp0x0068 00104 (Stack.go:7) ret0x0069 00105 (stack.go:7) nop0x0069 00105 (stack.go:3) pcdata$0, $-1//call Runtime.morestack_ Noctxt Stacks expansion 0x0069 00105 (stack.go:3) callruntime.morestack_noctxt (SB)//Return to the beginning of the function to continue execution 0x006e 00110 (stack.go:3) JMP0 ... "". Add1 t=1 size=28 args=0x18 locals=0x0//stack size is 0, the parameter is 24 bytes, the stack frame is less than stacksmall not into the stacks space to judge the direct execution 0x0000 00000 (stack.go:9) TEXT "". ADD1 (SB), $0-240x0000 00000 (stack.go:9) funcdata$0, Gclocals 54241e171da8af6ae173d69da0236748 (SB) 0x0000 00000 ( Stack.go:9) funcdata$1, Gclocals 33CDECCCCEBE80329F1FDBEE7F5874CB (SB) 0x0000 00000 (stack.go:9) MOVQ$0, "". ~r2+24 (FP) 0x0009 00009 (stack.go:10) movq "". X+8 (FP), ax0x000e 00014 (stack.go:10) movq "". Y+16 (FP), cx0x0013 00019 (stack.go:10) addqcx, ax0x0016 00022 (stack.go:10) Movqax, "". ~r2+24 (FP) 0x001b 00027 (stack.go : RET "". Add2 t=1 size=151 args=0x18 locals=0xd0//stack size is 208 bytes, parameter is 24 bytes 0x0000 00000 (stack.go:13) TEXT "". ADD2 (SB), $ 208-24//get current g0x0000 00000 (stack.go:13) movq (TLS), cx//stack size greater than stacksmall, calculate Sp-framszie + stacksmall and put into AX register 0x0009 0000 9 (stack.go:13) LEAQ-80 (SP), ax//compare the calculated values above and g.stackguard00x000e 00014 (stack.go:13) Cmpqax, (CX)//less than g.stackguard0, Jump to 141 execution stack expansion 0x0012 00018 (stack.go:13) jls141//continue execution 0x0014 00020 (stack.go:13) subq$208, sp0x001b 00027 (stack.go:13) MOVQBP, (SP) 0x0023 00035 (stack.go:13) LEAQ200 (sp), bp0x002b 00043 (stack.go:13) funcdata$0, Gclocals 54241e171da8af6ae173d69da0236748 (SB) 0x002b 00043 (stack.go:13) funcdata$1, Gclocals 33CDECCCCEBE80329F1FDBEE7F5874CB (SB) 0x002b 00043 (stack.go:13) movq$0, "". ~r2+232 (FP) 0x0037 00055 (stack.go:14) MOVQ$ 0, "". Autotmp_0 (SP) 0x003f 00063 (stack.go:14) Leaq "". Autotmp_0+8 (SP), di0x0044 00068 (stack.go:14) XORPSX0, x00x0047 00071 (stack.go:14) duffzero$2470x005a 00090 (stack.go:14) Leaq "". Autotmp_0 (SP), ax0x005e 00094 (stack.go:14) Testbal, (AX) 0x0060 00096 (stack.go:14) jmp980x0062 00098 (stack.go:15) MOVQ " ". x+216 (FP), ax0x006a 00106 (stack.go:15) movq" ". y+224 (FP), cx0x0072 00114 (stack.go:15) addqcx, ax0x0075 00117 (stack.go : Movqax, "". ~r2+232 (FP) 0x007d 00125 (stack.go:15) MOVQ200 (SP), bp0x0085 00133 (stack.go:15) addq$208, sp0x008c 00140 ( stack.go:15) ret0x008d 00141 (stack.go:15) nop0x008d 00141 (stack.go:13) pcdata$0, $-1//call Runtime.morestack_ Noctxt Complete Stack expansion 0x008d 00141 (stack.go:13) callruntime.morestack_noctxt (SB)//jump to the beginning of the function to continue execution 0x0092 00146 (stack.go:13) JMP0 ...

Through the above sink code, you can see that when the called function stack frame is less than stacksmall when the stack space is not executed to judge, but to some extent, to optimize the small function of the call. Larger than Stacksmall, will execute the stack space size judgment, when the stack space is insufficient, by calling Runtime.morestack_noctxt to complete the stack expansion, and then start executing the function again.

Go before 1.3 stack expansion using a segmented stack (segemented stack), when the stack space is not enough, the new application of a stack space for the execution of the called function, after the execution of the new request to destroy the stack space and back to the old stack space to continue to execute, Hot split can be triggered when a function is frequently called (recursive). To avoid hot split,1.3 after the use of a continuous stack (contiguous stack), when the stack space is not enough time to request a new stack of twice times the current size, and all the data copied to the new stack, the next all the call execution occurs on the new stack.

Stack expansion and copy is not an easy thing to do, it involves a lot of content and details, here only the basic process and algorithm intent, not to go into all the details.

The Runtime.morestack_noctxt is implemented in assembler, and the following is a partial code for the AMD64 Architecture (RUNTIME/ASM_AMD64.S):

//called during function prolog when more stack is needed.////the trace Back routines See Morestack in a G0 as being//the top of a stack (for example, Morestack calling newstack//calling the S Cheduler calling NEWM calling GC), so we must//the record an argument size. For this purpose, it has no arguments. TEXT runtime Morestack (SB), nosplit,$0-0//cannot grow scheduler stack (M->G0). GET_TLS (CX) MOVQG (CX), Bxmovqg_m (BX), Bxmovqm_g0 (BX), SICMPQG (CX), SIJNE3 (PC) callruntime badmorestackg0 (SB) int$3//omit signal stack, morebuf and sched processing ...// Call Newstack on M->g0 ' s stack. MOVQM_G0 (BX), BXMOVQBX, G (CX) Movq (G_SCHED+GOBUF_SP) (BX), sppushqdx//ctxt argument// Call Runtime.newstack to complete the stack expansion callruntime newstack (SB) movq$0, 0x1003//crash if Newstack returnspopqdx//keep balance Check happyret//Morestack but not preserving ctxt.
 TEXT runtime Morestack_noctxt (SB), nosplit,$0movl$0, dx//call Morestackjmpruntime morestack (SB) 

Newstack is implemented with go, readability is very interesting, everyone has the space to read, the basic process is to allocate a 2x size of the new stack, copy the data to the new stack, and replace the new stack with the old stack, the following is part of the Code (RUNTIME/STACK.GO):

//called from Runtime Morestack when the more stack is needed.//Allocate larger stack and relocate to new stack.//Stack growth is multiplicative, for constant amortized cost.////G->atomicstatus'll be grunning or gscanrunning upon entry.//If The GC is trying-stop this g then it'll set Preemptscan to true.////Ctxt is the value of the context register on Morestack. Newstack//would write it to G.sched.ctxt. func newstack(ctxt unsafe. Pointer) {THISG: = GETG () GP: = Thisg.m.curg//capacity up to twice times the currentOldsize: =int(gp.stackalloc) NewSize: = Oldsize *2//The goroutine must is executing in order to call Newstack,//So it must is grunning (or gscanrunning).Casgstatus (GP, _grunning, _gcopystack)//The concurrent GC won't scan the stack while we are doing the copy since//The GP is in a gcopystack status.//Copy stack data and switch to new stackCopystack (GP,UIntPtr(newsize),true)ifStackdebug >=1{Print("Stack grow done\n")}//Resume ExecutionCasgstatus (GP, _gcopystack, _grunning) Gogo (&gp.sched)}//Copies GP ' s stack to a new stack of a different size.//Caller must has changed GP status to Gcopystack.////If sync is true, the self-triggered stack growth and, in///particular, no other G could writing to GP's stack (e.g., via a//channel operation). If sync is false, Copystack protects against//Concurrent channel operations. func copystack(GP \*g, newsize uintptr, sync bool) {ifGp.syscallsp! =0{Throw ("stack growth not allowed on system call")}old: = Gp.stackifOld.lo = =0{Throw ("Nil StackBase")}used: = Old.hi-gp.sched.sp//Copy data to the new stackMemmove (unsafe. Pointer (New. hi-ncopy), unsafe. Pointer (old.hi-ncopy), ncopy)//Allocate new stack from cache or heapNew, Newstkbar: = stackalloc (UInt32(newsize))ifStackpoisoncopy! =0{Fillstack (New,0xfd)}//Switch to new stackGp.stack =NewGp.stackguard0 =New. Lo + _stackguard// Note: might clobber a preempt requestGP.SCHED.SP =New. hi-usedoldsize: = Gp.stackAllocgp.stackAlloc = Newsizegp.stkbar = newstkbargp.stktopsp + = Adjinfo.delta//Adjust pointers in the new stack.Gentraceback (^UIntPtr(0), ^UIntPtr(0),0Gp0,Nil,0x7fffffff, Adjustframe, Noescape (unsafe. Pointer (&adjinfo)),0) Gcunlockstackbarriers (GP)//release old stacksifStackpoisoncopy! =0{Fillstack (old,0XFC)}stackfree (old, Oldsize)}

Let's take a look at the shrinking capacity. Some long running goroutine may be caused by a function call in the expansion of the stack, after the call function returned a large part of the space is not exploited, in order to solve such a problem, you need to be able to shrink the stack to save memory to improve utilization.

Stack contraction does not occur when a function is called, but is initiated by the garbage collector when it is garbage collected. The basic process is to calculate the current use of space, less than 1/4 of the stack space, the execution of the stack contraction, the stack is shrunk to the current 1/2, otherwise directly returned. Here is a partial code for stack contraction (runtime/stack.go):

 func shrinkstack(GP *g) {gstatus: = Readgstatus (GP)ifGstatus&^_gscan = = _gdead {ifGp.stack.lo! =0{//Free whole stack-it would get reallocated//If G is used again.Stackfree (Gp.stack, gp.stackalloc) Gp.stack.lo =0Gp.stack.hi =0Gp.stkbar =NilGp.stkbarpos =0}return}//Contraction target is half sizeOldsize: = gp.stackallocnewsize: = oldsize/2//Don ' t shrink the allocation below the minimum-sized stack//allocation.ifNewSize < _fixedstack {return}//If more than 1/4 space is used, do not shrinkAvail: = Gp.stack.hi-gp.stack.loifUsed: = Gp.stack.hi-gp.sched.sp + _stacklimit; Used >= avail/4{return}//Replace the current stack with a smaller stackCopystack (GP, NewSize,false)}

Scaling capacity Impact

In the normal HTTP service, RPC service, the impact of the stack expansion and contraction is almost negligible, you can skip the problem when troubleshooting. In some memory-intensive, latency-sensitive services, pay special attention, otherwise you will likely face high memory consumption, unstable service situation.

We use go to build the hybrid push service, each connection is full-duplex, using two routine to handle the read and write, the first to start on-line pressure test memory consumption is very high, even the situation of oom. Initially suspected of heap occupancy, through the runtime and pprof, heap occupancy and expected to imagine the same, and not too many problems, once very big. Later by curl -s http://localhost:port/debug/pprof/heap?debug=1 | grep -A 20 runtime.MemStats looking at the Memstats state, found that the stack occupies a high or even up to 20G, basically determined that the problem is caused by the stack, then you can use the tool to locate the specific reason.

We use perf and flamegraph to trace function calls, and here's the section:

You can see that when RPC calls (Grpc invoke), the stack expands (runtime.morestack), which means that any RPC call in the read-write routine will cause the stack to expand, the memory space will be expanded to twice times the original, 4kB stack will become 8kB , the memory footprint of the 100w connection expands from 8G to 16G (full duplex, regardless of other overhead), which is a nightmare.

There are many solutions to this problem, and we have chosen channel and worker group, the read-write routine is only responsible for traffic and connection processing, and the logical processing part is completely handed to the worker. After optimization, read-write routine each occupy 4KB of memory, the operation process will not appear the problem of stack expansion, stand-alone (24core 32G memory) can be hosted 100W connection and the transmission of 2~3W messages per second (up to a byte).

The goal of the stack shrink described above is to improve memory utilization, but there will be a stack copy and write barrier (write barrier) during the scaling process, and there may be some impact on some quasi-real-time applications. Fortunately, go provides parameters that can be set, and you can turn off the stack indent by setting the environment variable godebug=gcshrinkstackoff=1. After you close the stack, you need to take the risk of the stack growing, which you need to consider carefully before closing.

If you want to see the stack alloc, expand, Copy, and indent details during the run of the program, you can set the Stackdebug variable (runtime/stack.go) to non-0, and then recompile the program (remember to recompile runtime, add parameters at compile time -a ), You can see the details of all the stack operations. Temporarily did not find a better way to set up such as Godebug, if everyone better way, welcome to tell me.

Note

All of the above constants and codes are based on the Linux x86_64 architecture, go 1.8.3 version.

Reference documents

    1. https://blog.cloudflare.com/how-stacks-are-handled-in-go/
    2. http://www.brendangregg.com/ perf.html
    3. https://github.com/qyuhen/book
    4. https://en.wikipedia.org/wiki/Thread_ (computing)
    5. https://golang.org/doc/asm
    6. https://0xax.github.io/
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.