System knowledge Behind the Goroutine

Last Update:2015-12-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Go language from the birth to popularization has been three years, the forerunner is mostly the background of web development, there are some popular type of books, system development background of people in learning these books, there is always a feeling of vague, there are a number of articles on the Internet, there are some more or less than the fact that some of the technical description is inconsistent with the facts. Hopefully this article will introduce the system knowledge behind Goroutine for a web developer who lacks a system programming background.

1. Operating system and Runtime library
2. Concurrency and parallelism (Concurrency and Parallelism)
3. Scheduling of threads
4. Concurrent Programming Framework
5. Goroutine

1. Operating system and Runtime library

For ordinary computer users, it is sufficient to understand that the application is running on the operating system, but for developers, we also need to understand how our program is running on top of the operating system, how the operating system serves the application, so that we can know which services are provided by the operating system. And which services are provided by the runtime of the language we use.

In addition to internal modules such as memory management, file management, process management, peripheral management, and so on, the operating system provides a number of external interfaces that are called "system calls" for use by applications. From the DOS era, the system call is provided by the form of soft interrupts, that is, the famous int 21, the program calls the function number into the AH register, put the parameter into the other specified register, and then call Int 21, after the interrupt returned, the program from the specified register (usually AL) Gets the return value. This has been done until the Pentium 2 is P6 out before the change, such as Windows through the int 2E to provide system calls, Linux is an int 80, but later the register is larger than before, and may be more than one level of Jump table query. Later, Intel and AMD provided more efficient sysenter/sysexit and syscall/sysret instructions to replace previous interrupts, skipping time-consuming privilege-level checks and register-stack operations, directly completing the Ring 3 code segment to the ring 0 The conversion of the code snippet.

What functionality does the system call provide? With the operating system name plus the corresponding interrupt number to Google on a check can get a complete list (Windows, Linux), this list is the operating system and applications to communicate between the Protocol, if the need to exceed the functionality of this protocol, we can only in their own code to implement, for example, for memory management , the operating system only provides process-level memory segment management, such as Windows Virtualmemory Series, or Linux BRK, the operating system does not care about how the application allocates memory for new objects, or how to do garbage collection, which need to be implemented by the application itself. If the functionality beyond this protocol cannot be implemented on its own, then we say that the operating system does not support this feature, for example, Linux does not support multithreading before 2.6, regardless of the simulation in the program, we can not make a number of concurrent operation and meet the POSIX 1003.1c semantic standard scheduling unit.

However, we do not need to write the program to call the interrupt or syscall instruction, this is because the operating system provides a layer of encapsulation, on Windows, it is NTDLL.DLL, which is often said native API, we do not need to directly call int 2E or syscall, To be exact, we cannot directly call int 2E or syscall, because Windows does not expose its invocation specification, and using int 2E or syscall directly does not guarantee future compatibility. On Linux There is no problem, the list of system calls is public, and Linus very important compatibility, not to make any changes, glibc even provide syscall (2) to facilitate the user directly with the number call, however, To address the hassle of different versions of compatibility between GLIBC and the kernel, and to improve the efficiency of some calls (such as __nr_ gettimeofday), Linux is a layer of encapsulation of some system calls, Vdso (earlier called linux-gate.so).

However, we write the program rarely directly call Ntdll or VDSO, but through a more layer of encapsulation, this layer handles the parameter preparation and return value format conversion, and error handling and fault code conversion, which is the runtime of the language we use, for the C language, Linux is glibc, Windows is kernel32 (or called MSVCRT), and for other languages, such as Java, it is the JRE, and these "other languages" typically end up calling glibc or KERNEL32.

The term "runtime" actually includes not only the library files used to link with the compiled target execution program, but also the operating environment of the scripting language or bytecode-interpreted language, such as the python,c# Clr,java JRE.

Encapsulation of system calls is only a small part of the runtime's functionality, and the runtime typically provides features such as string processing, mathematical calculations, common data structure containers, and so on that do not require operating system support, while the runtime also provides more advanced encapsulation of features supported by the operating system, such as IO with cache and format, The thread pool.

So, when we say "XXX language has added a certain function", it is usually a few possible:
1. Support new semantics or syntax so that we can describe and solve problems. such as Java generics, Annotation, lambda expressions.
2. A new tool or class library has been provided, reducing the amount of code we have developed. For example, Python 2.7 argparse
3. A better and more comprehensive encapsulation of system calls allows us to do things that were impossible or difficult to do before in this language environment. such as Java NIO

But any language, including its runtime and operating environment, is unlikely to create features that are not supported by the operating system, as is the case with the go language, no matter how flashy its characterization may seem, but it is something other languages can do, but go provides more convenient and clear semantics and support, improving the efficiency of development.

2. Concurrency and parallelism (Concurrency and Parallelism)

Concurrency refers to the logical structure of a program. Non-concurrent program is a bamboo stick to the end, only a logical control flow, that is, the sequential execution of the (sequential) program, at any time, the program will only be in this logic control flow somewhere. And if a program has multiple independent logical control flows that can handle multiple things at the same time (deal), we say that the program is concurrent. The "simultaneous" here does not have to be true at some point in the clock (which is the running state rather than the logical structure), but rather: if these logical control flows are drawn as sequential flowcharts, they can overlap on the timeline.

Parallelism refers to the running state of a program. If a program is processed at the same time by multiple CPU pipelines at a time, then we say that the program is running in parallel. (Strictly speaking, we can not say that a program is "parallel", because "parallel" is not to describe the program itself, but to describe the state of the operation of the program, but this text is not so much of a word, the following "parallel", that is, refers to "running in parallel") Obviously, parallel must be hardware support.

And it's not hard to understand:

1. Concurrency is a necessary condition for parallelism, and if a program itself is not concurrent, that is, there is only one logical control flow, then we cannot allow it to be processed in parallel.

2. Concurrency is not a sufficient condition for parallelism, a concurrent program that is not parallel if it is processed only by a CPU pipeline (via ticks).

3. Concurrency is only more in line with the nature of the actual problem of expression, the original purpose of concurrency is to simplify the code logic, rather than make the program run faster;

These paragraphs are slightly abstract, we can use one of the simplest examples to instantiate these concepts: write in C the simplest HelloWorld, it is non-concurrent, if we build multiple threads, each thread to print a HelloWorld, then this program is concurrent, If this program is running on an old-fashioned single-core CPU, then this concurrent program is not parallel, and if we run it with multi-core multi-CPU and multi-tasking operating system, then this concurrency program is parallel.

There is also a slightly more complex example, more to illustrate that concurrency is not necessarily parallel, and concurrency is not for efficiency, that is, the go language example of the sieve.go of the calculation of prime numbers. We started a code snippet for each factor, and if the current number of validations can be removed by the current factor, then the number is not a prime, if not, the number is sent to the next factor's code fragment, until the last factor is not able to be removed, then the number is prime, we start a code fragment of it, Used to validate a larger number. This is in line with our logic of calculating primes, and the code processing fragments for each factor are the same, so the program is very concise, but it cannot be parallelized, because each fragment relies on the processing result and output of the previous fragment.

Concurrency can be done in the following ways:

1. Explicitly define and trigger multiple code snippets, which are logical control flows that are dispatched by the application or the operating system. They can be independent or interdependent, such as the prime calculation mentioned above, but it is also a classic producer and consumer problem: two logic control flow A and b,a produce output, when the output is obtained, B gets the output of a to process. Threads are just one way to implement concurrency, and in addition, the runtime or the application itself has multiple means to implement concurrency, which is the main content of the next section.

2. Implicitly place multiple snippets of code, triggering the execution of the corresponding code snippet when a system event occurs, i.e., an event-driven way, such as when a port or pipeline receives data (in the case of multiple IO), such as when a process receives a signal (signal).

Parallelism can be done on four levels:

1. More than one machine. Naturally we have multiple CPU pipelines, such as the MapReduce task in a Hadoop cluster.

2. Multi-CPU. Whether it's a real CPU or multicore or Hyper-threading, we have multiple CPU lines.

3. ILP (Instruction-level parallelism) in a single CPU core, instruction-level parallelism. Through complex manufacturing processes and the parsing of instructions and branching predictions and disorderly execution, the current CPU can execute multiple instructions in a single clock cycle, so that even non-concurrent programs may be executed in parallel.

4. Single instruction multi-data (mono instruction, multiple data. SIMD), for the processing of multimedia data, now the instruction set of the CPU supports a single instruction to operate on multiple pieces of data.

Of these, 1 involves distributed processing, including the distribution of data, synchronization of tasks and so on, and is network-based. 3 and 4 are usually the compilers and CPUs developers need to consider. Here we say the parallel is mainly for the 2nd type: Multi-core CPU parallel in a single machine.

With regard to concurrency and parallelism, Rob Pike, the author of the Go language, wrote a slide on the topic: http://talks.golang.org/2012/waza.slide

This picture in the famous "Computer systems:a Programmer ' perspective" in CMU is also very intuitive:

3. Scheduling of threads

The previous section mainly refers to the concept of concurrency and parallelism, while threads are the most intuitive concurrency implementation, this section we mainly say how the operating system allows multiple threads to execute concurrently, of course, in the multi-CPU, that is, parallel execution. We do not discuss the process, the meaning of the process is "isolated execution environment", rather than "separate execution sequence".

We first need to understand the instruction control of the IA-32 CPU in order to understand how to switch between multiple instruction sequences (that is, logical control flow). The CPU determines the position of the next instruction by the value of the CS:EIP register, but the CPU does not allow direct use of the MOV instruction to change the value of the EIP, it must be implemented via the JMP series instruction, the call/ret instruction, or the int interrupt instruction to enable the code to jump; When switching between instruction sequences In addition to changing the EIP, we also ensure that the values of each register that the code may use, especially the stack pointer ss:esp, and the EFLAGS flag bit, can be restored to the state of the target instruction sequence when it was last executed to that position.

A thread is a service that is provided externally by the operating system, which allows the operating system to start a thread through system calls and is responsible for subsequent thread scheduling and switching. We first consider a single core CPU, the operating system kernel and the application is actually sharing the same CPU, when the EIP in the application code snippet, the kernel does not have control, the kernel is not a process or thread, the kernel is only run in real mode, the code snippet permission for the ring 0 in-memory program, Control is transferred to the kernel only when an interrupt is generated or when the application calls the system, and in the kernel, all the code is in the same address space, and in order to serve different threads, the kernel builds a kernel stack for each thread, which is the key to thread switching. Typically, the kernel will dispatch the entire system's threads, calculate the remaining time slices of the current thread, and, if necessary, compute the priority in the "running" thread queue before the clock breaks or the system call returns (taking into account performance, usually before a system call is returned infrequently), and the target thread is selected. Saves the current thread's running environment and restores the running environment of the target thread, and most importantly, switches the stack pointer to ESP, and then points the EIP to the target thread when it was last moved out of the CPU. The Linux kernel, when implementing thread switching, played a spear, not direct jmp, but first switched ESP to the kernel stack of the target thread, stacked the code address of the target thread, and then jmp to __switch_to (), equivalent to forging a call __switch_to () instruction, then, at the end of the __switch_to () using the RET instruction to return, so that the target thread in the stack code address into the EIP, then the CPU began to execute the target thread of code, in fact, the last stop in the SWITCH_TO this macro expansion of the place.

Here are a few things to add: (1) Although IA-32 provides TSS (Task state Segment), which attempts to streamline the process of thread scheduling for the operating system, it is inefficient and not a common standard and is not conducive to porting, so the mainstream operating system does not take advantage of TSS. More strictly, TSS is used, because only the TSS can switch the stack to the kernel stack pointer ss0:esp0, but in addition the TSS function is completely unused. (2) When the thread enters the kernel from the user state, the relevant registers and the EIP of the user-state code snippet have been saved once, so there is not much content to be saved and restored when the kernel-state thread switches above. (3) described above are preemptive (preemptively) scheduling mode, the kernel and the hardware driver will also be waiting for external resources when the active call schedule (), the user-state code can also be Sched_yield () system call to initiate scheduling, Let the CPU out.

Now we usually have multiple CPUs (physical package) in a common PC or service, each CPU has multiple cores (processor core), and each core can support Hyper-threading (two logical processors for each Core), which is the logical processor. Each logical processor has its own set of complete registers, including CS:EIP and SS:ESP, so that each logical processor is a separate pipeline in terms of operating system and application. In the case of multiprocessor, the principle of thread switching and the process is basically consistent with the single-processor, the kernel code only one, when a clock interrupt on a CPU or system calls, the CPU Cs:eip and control back to the kernel, the kernel based on the results of the scheduling policy to switch threads. But at this point, if our program uses threads to implement concurrency, then the operating system can make our program parallel on multiple CPUs.

There are also two points to add: (1) Multi-core scenarios in which the cores are not completely equal, for example, two hyper-threads on the same core are shared l1/l2 caches, and in NUMA-supported scenarios, the latency of each kernel accesses different areas of memory is not the same; So, thread scheduling in a multicore scenario introduces The concept of the "dispatch domain" (scheduling domains), but this does not affect our understanding of the thread switching mechanism. (2) In multi-core scenario, which CPU is interrupted? Soft interrupts (including dividing by 0, page faults, int instructions) are naturally generated on the CPU that triggered the interrupt, and there are two cases of a hard interrupt, one for each CPU itself, such as a clock, which is handled by each CPU itself, and an external interrupt such as IO, The APIC can be used to specify which CPU it is sent to, because the scheduler can only control the current CPU, so if the IO interrupt is not evenly distributed, then the IO-related threads can only run on some CPUs, resulting in uneven CPU load, thus affecting the efficiency of the entire system.

4. Concurrent Programming Framework

The above is about a multi-threaded implementation of the concurrency program is how the operating system scheduling and parallel execution (when there are multiple logical processors), while you can see that the code snippet or logic control flow scheduling and switching is not mysterious, theoretically, we can also not rely on the operating system and its provided threads, Define multiple fragments in the code snippet of your own program, and then dispatch and switch to it in our own program.

For the sake of convenience, we'll refer to "code Snippets" as "Tasks".

Similar to the implementation of the kernel, but we do not need to consider interrupts and system calls, then our program is essentially a loop, the loop itself is the Scheduler schedule (), we need to maintain a list of tasks, according to our defined strategy, FIFO or priority, etc. Each time a task is picked out from the list, then the values of each register are restored, and JMP is where the task was last paused, and all the information that needs to be saved can be used as a property of the task and stored in the task list.

It looks simple, but there are a few things we need to solve:

(1) We run in the user state, there is no interruption or system call such a mechanism to interrupt the execution of the code, then, once our schedule () code to give control to the code of the task, our next schedule when the time of occurrence? The answer is, will not happen, only by the task of actively call schedule (), we have the opportunity to dispatch, so, the task here can not be like thread-dependent kernel scheduling and thus without scruple to execute, our mission must explicitly call schedule (), this is called collaborative ( Cooperative) dispatch. (Although we can register the signal processing function to simulate the clock interrupt in the kernel and gain control, the problem is that the signal processing function is called by the kernel, at the end of it, the kernel regain control, and then return to the user state and continue along the code path interrupted when the signal occurs, Thus we cannot perform task switching within the signal processing function)

(2) stack. As with the kernel scheduler thread, we also need to allocate stacks separately for each task, and save the stack information in the task properties, and save or restore the current SS:ESP when the task is switched. The space of the task stack can either be allocated on the stack of the current thread, or it can be allocated on the heap, but it is usually better to allocate on the heap: there is little or no size or the total number of tasks, the stack size can be dynamically expanded (GCC has split stack, but it is too complex) to switch tasks to other threads.

Here, we probably know how to construct a concurrent programming framework, how can a task be executed in parallel on multiple logical processors? Only the kernel has the ability to dispatch CPUs, so we still have to create threads through system calls before we can implement parallelism. There are a few things we need to consider when multitasking is multi-threaded:

(1) If a task initiates a system call, such as a long wait for IO, the current thread is placed in the queue waiting to be dispatched by the kernel, wouldn't it be a chance for other tasks to execute?

In the case of single-threaded, we only have one solution, is to use the non-blocking IO system call, and let out the CPU, and then in schedule () in the unified polling, there is data to switch back to the FD corresponding task; the less efficient approach is not to conduct a unified polling, Allow each task to do IO again in a non-blocking manner when it is its turn to execute until there is data available.

If we use multiple threads to construct our entire program, we can encapsulate the interface that the system calls, and when a task enters the system call, we leave the current thread to it (temporarily) and open the new thread to handle the other tasks.

(2) Task synchronization. For example, in the examples of producers and consumers mentioned in the last section, how can we allow consumers to wait until the data has been produced and trigger the consumer to continue executing when the data is available?

In the case of a single thread, we can define a structure in which variables are used to hold the interaction data itself, as well as the current available state of the data, and the number of two tasks that are responsible for reading and writing this data. Then our concurrent programming framework provides the read and write methods for the task call, in the Read method, we loop to check whether the data is available, if the data is not available, we call schedule () Let the CPU into the wait, in the Write method, we write data to the structure, Change the available state of the data and return it; in schedule (), we check the available state of the data, and if available, activate the task that needs to read this data, and the task continues to loop to detect whether the data is available, discover available, read, change state to be unavailable, return. The simple logic of the code is as follows:

structChan{    boolReady, IntData};IntRead(structChan*C) { While (1) { If (C-Ready) {C-Ready= False; ReturnC-Data; } Else {Schedule(); } }}voidWrite(structChan*C, IntI) { While (1) { If (C-Ready) { Schedule ();  } else { c< Span class= "pun" >->data = I;->ready = true< Span class= "pun" >;  Schedule (); //optional return; Span class= "pun" >} }}

Obviously, if it is multithreaded, we need to protect the data access to this structure through the synchronization mechanism provided by the line libraries or system call.

These are the most streamlined design considerations for a concurrency framework, and the concurrency frameworks that we encounter in our actual development work may vary in both language and runtime, and may have tradeoffs in functionality and usability, but the underlying principles are the same.

For example, the Getcontext/setcontext/swapcontext series library functions in Glic can be conveniently used to save and restore task execution state; Windows provides the SDK API for the fiber series; neither of which is a system call. Although the man page of GetContext and SetContext is in section 2, it is only a historical legacy of SVR4, and its implementation code is provided in Kernel;createfiber glibc rather than kernel32, There is no corresponding ntcreatefiber in the ntdll.

In other languages, our so-called "quests" are more often called "coroutine," or "the process". such as C + + The most commonly used is boost.coroutine;java because there is a layer of bytecode interpretation, more troublesome, but also has a support of the JVM patch, or dynamically modify bytecode to support the project; PHP and Python's generator and yield are in fact supported by the process, which can encapsulate a more General process interface and scheduling, as well as the native support for the Erlang, and so on, I do not understand, do not say, specifically see the Wikipedia page: Http://en.wikipedia.org/wiki/Coroutine

Because the Save and restore task execution state requires access to the CPU registers, the associated runtime also lists the supported CPU lists.

The ability to provide co-operation from the operating system level and its parallel scheduling, like only the Grand Central Dispatch for OS X and iOS, most of its functionality is also implemented in the runtime.

5. Goroutine

The go language provides the clearest and most straightforward support for concurrent programming in all (I know) languages so far through Goroutine, and the Go language document describes its features in a very comprehensive or even more detailed way, based on our system knowledge above, To enumerate the characteristics of Goroutine is a summary:

(1) Goroutine is the function of the go language runtime, not the functionality provided by the operating system, Goroutine is not implemented with threads. Refer to the PKG/RUNTIME/PROC.C in the Go language source

(2) Goroutine is a piece of code, a function entry, and a stack allocated to it on the heap. So it's very cheap and we can easily create tens of thousands of goroutine, but they're not executed by the operating system.

(3) In addition to a thread that is blocked by the system call, the Go runtime will start up to $gomaxprocs threads to run Goroutine

(4) Goroutine is a cooperative scheduling, if the goroutine will take a long time, and not by waiting to read or write to the channel data to synchronize, you need to actively call gosched () to let the CPU

(5) As in all other concurrent frames, the goroutine advantage of the so-called "no lock" is only valid on a single thread, and if $gomaxprocs > 1 and the co-process need to communicate, the Go runtime is responsible for lock-protected data. This is why sieve.go such an example is slower in multi-CPU multi-Threading

(6) The Web and other service-side programs to deal with the request is essentially a parallel processing problem, each request is basically independent, non-dependent, almost no data interaction, this is not a model of concurrent programming, and the concurrent programming framework only solves the complexity of its semantic expression, not fundamentally improve the efficiency of processing, Perhaps the English of concurrent connections and concurrent programming is concurrent, and it is easy to create a misunderstanding that the concurrent programming framework and coroutine can handle a large number of concurrent connections efficiently.

(7) The Go Language runtime encapsulates asynchronous IO, so it is possible to write a server that looks like a lot of concurrency, but even if we tune $gomaxprocs to take advantage of multi-core CPU parallelism, it is not as efficient as we use IO event-driven design to divide the appropriate proportion of thread pools by transaction type. In response time, collaborative scheduling is flawed.

(8) Goroutine The biggest value is that it realizes the concurrent and actual parallel execution of the thread mapping and dynamic expansion, along with the development and improvement of its running library, its performance will be better, especially in the CPU more and more of the future of the number of cores, One day we will give up the difference in performance for the simplicity and maintainability of the code.

System knowledge Behind the Goroutine

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

System knowledge Behind the Goroutine

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

System knowledge Behind the Goroutine

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support