Many experienced engineers will see this error when using JVM-based languages:
[error] (run-main-0) java.lang.OutOfMemoryError: unable to create native thread: [error] java.lang.OutOfMemoryError: unable to create native thread: [error] at java.base/java.lang.Thread.start0(Native Method)[error] at java.base/java.lang.Thread.start(Thread.java:813)...[error] at java.base/java.lang.Thread.run(Thread.java:844)
Well, it's a outofmemory caused by a thread. When running the Linux operating system on my laptop, this error occurs after only 11,500 threads have been created.
If you do the same thing on the go language and start the goroutines that is always dormant, you will see very different results. On my laptop, I was able to create 70 million goroutines before I felt really bored. So why is the number of goroutines far beyond the thread? To uncover the answer to the question, we need to go down a round trip down the operating system. It's not just an academic issue, it has a real impact on how you design software. In a production environment, I have had multiple encounters with JVM thread limitations, some because of bad code leak threads, or because engineers are unaware of the JVM's threading limitations.
So what the hell is a thread?
The term "thread" can be used to describe a lot of different things. In this article, I'll use it to refer to a logical thread. That is, a series of operations in a linear order; a logical path for execution.  Each core of the CPU can only really concurrently execute one logical thread at a time. This poses an inherent problem: if the number of threads is greater than the number of cores, then some threads must be paused for the other thread to run their work, and the task will be restored when it is its turn to execute. To support pausing and resuming, the thread needs at least the following two things:
1, some kind of instruction pointer. That is, what line of code am I executing when I pause?
2, a stack. That is, what is my current state? The stack contains a local variable and a pointer to the heap allocated by the variable. All threads in the same process share the same heap .
Given the above two points, the system has sufficient information to suspend a thread, allow other threads to run, and then restore the original thread again when the thread is dispatched to the CPU. This operation is generally completely transparent to the thread. From the thread's point of view, it is running continuously. The only way a thread can perceive a rescheduling is to measure the timing between successive operations .
Back to our most primitive question: why do we have so many goroutines?
JVM uses operating system threads
Although not required by the specification, as far as I know, all modern, generic JVMs are delegating threads to the platform's operating system threads for processing. In the next section, I will use the user space thread to refer to a thread that is dispatched by the language instead of the thread that the kernel/os. The threads implemented by the operating system have two properties that greatly limit the number of them that can exist, and any solution that maps the language thread and the operating system thread to 1:1 cannot support large-scale concurrency.
In the JVM, a fixed-size stack
Using operating system threads will cause each thread to have a fixed, large memory cost
Another major problem with operating system threads is that each OS thread has a fixed-size stack. Although this size is configurable, in a 64-bit environment, the JVM allocates 1M stacks for each thread. You can set the default stack space to be smaller, but you need to weigh the use of memory because it increases the risk of stack overflow. The more recursion you have in your code, the more likely the stack overflow will occur. If you keep the default value, then 1000 threads will use 1GB of RAM. Although RAM is much cheaper now, few people will be preparing terabytes of RAM to run millions of threads.
How the go behaves differently: dynamic-sized stacks
Golang has taken a clever trick to prevent the system from running out of memory because it runs a large (mostly unused) stack: The stack of Go is dynamically allocated and grows and shrinks as the amount of data is stored. This is not an easy thing to do, and its design has gone through multiple rounds of iterations . I'm not going to explain the internal details (about this, there are a lot of blog posts and other materials in detail), but the conclusion is that each new goroutine only about 4KB stack. Each stack is only 4KB, so on a 1GB ram, we can have 2.5 million goroutine, which is a huge boost relative to the 1MB per thread in Java.
In the JVM: latency for context switching
From the perspective of context switching, only tens of thousands of threads can be used with an operating system thread
Because the JVM uses operating system threads, it relies on the operating system kernel to dispatch them. The operating system has a list of all the processes and threads that are running, and tries to assign them a "fair" CPU Run time . There's a lot of work to do when the kernel switches from one thread to another. New running threads and processes must abstract the fact that other threads are also running on the same CPU. I will not discuss the details here, but if you are interested, you can read more materials. The important thing here is that the switching context consumes 1 to 100 microseconds. This does not seem to be much time, and it is relatively realistic to switch between 10 microseconds at a time, and if you want to schedule at least one thread per second, then only about 100,000 threads can be run on each core. This does not actually give the thread time to perform useful work.
The behavior of Go is different: Run multiple goroutines on one operating system thread
Golang implements its own scheduler, allowing numerous goroutines to run on the same OS thread. Even if go runs the same context switch as the kernel, it can avoid switching to ring-0 to run the kernel and then switch back, which saves a lot of time. However, this is only a paper analysis. More complex things need to be done to support millions of goroutines,go.
Even if the JVM places threads into user space, it cannot support millions of threads. Suppose that in such a new design system, the switch between new threads requires only 100 nanoseconds. Even if all you do is context switches, you can only run about 1 million threads if you want to schedule each thread 10 times per second. More importantly, in order to do this, we need to make the most of the CPU. Another optimization is required to support true concurrency: When you know that a thread can do useful work, you need to dispatch it. If you run a large number of threads, only a small number of threads will perform useful work. Go is achieved with the integrated Channel and Scheduler (scheduler). If a goroutine waits on an empty channel, the scheduler sees this and does not run the goroutine. Go is one step closer to putting most of the idle threads on its operating system thread. In this way, the active goroutine (which is expected to be much less) is scheduled to execute on the same thread, and millions of the most dormant goroutine are processed separately. This helps reduce latency.
It is not possible to support intelligent scheduling unless Java adds language features that allow the scheduler to observe. However, you can build the runtime scheduler in user space, which can sense when a thread can perform its work. This forms the basis of a framework like Akka, which can support millions of actor.
The transition between the operating system threading model and the lightweight, user-space threading model continues to occur and may continue in the future . This is the only option for highly concurrent user scenarios. However, it is quite complex. If go chooses to use OS threads instead of its own scheduler and incremental stack mode, then they can reduce the code by thousands of lines at run time. For many user scenarios, this is really a better model. Complexity can be abstracted by language and library writers, so that software engineers can write large numbers of concurrent programs.
Thank Leah Alpert for reading the first draft of this article.
1, Hyper-Threading will double the core effect. The instruction Stream (instruction pipelining) can also increase the parallel effect of the CPU. But for the moment, it's still O (numcores).
2, may be in some special scenes, this statement is not correct, I think certain people will remind me of this point.
4, Golang first uses a segmented stack model, in which the stack actually expands to a separate memory area, which is tracked using a very clever recording function. The subsequent implementation improves performance in a specific scenario, replacing the stack with a continuous stack, which is much like resizing a Hashtable, allocating a new, larger stack, and using some very skilful pointers, all of which can be carefully copied into the new, larger stack.
5. Threads can flag priorities by calling Nice (see man Nice), which gives them greater control over how often they are dispatched.
6, actor by supporting large-scale concurrency, for the Scala/java to achieve the same purpose as the goroutines characteristics. Similar to Goroutines, the actor scheduler can see which actor has messages in their Inbox, and only run actors that can perform really useful work. The number of actors we can have can even exceed goroutines, because actors don't need stacks. However, this also means that if the actor cannot process the message quickly, the scheduler will block (because the actor does not have its own stack, so it cannot pause during the actor's process of processing the message). A blocked scheduler means that the message cannot be processed and the system will soon have problems. This is a trade-off.
7. In Apache, each request is handled by an OS thread, which restricts Apache from handling only thousands of concurrent connections. Nginx chooses another model, an OS thread that can handle hundreds or even thousands of concurrent connections, allowing for a higher level of concurrency. Erlang uses a similar model, which allows millions of actors to execute concurrently. Gevent brings a greenlet (user-space thread) to Python, which can achieve a higher degree of concurrency than ever before (Python threads are OS threads).
Article Source: http://www.infoq.com/cn/articles/a-million-go-routines-but-only-1000-java-threads?useSponsorshipSuggestions=true
Related content recommendation: https://www.roncoo.com/course/list.html?courseName=%E5%B9%B6%E5%8F%91