Concurrent Pain Thread,goroutine,actor

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

This article is based on my February 27 gopher Beijing gathering speech, made some additions and adjustments. Submitted to the "High Availability Architecture" public number starter.

Before we talk about this topic, we'll start by combing the two concepts, and almost all of the concurrent articles should first talk about these two concepts:

    • concurrency (concurrency) concurrency is the focus of task segmentation. For example, you are the CEO of a start-up company, you are the only one, you have a multi-angle, one will do product planning, one will write code, a meeting with customers, although you can not see the client's writing code, but because you split the task, the allocation of time slices, manifested as if more than one task in the execution.
    • parallel (parallelism) parallelism is the focus of execution at the same time. Or the above example, you find yourself too busy, time allocation is not over, so invited engineers, product managers, marketing director, the Secretary of the post, this time multiple tasks can be executed simultaneously.

Therefore, in summary, concurrency does not require parallelism, can be simulated in the form of time slices, such as a multi-tasking system on a single-core CPU, the concurrency requirement is that the task can be cut into separate execution fragments. The parallel concern is the simultaneous execution, must be multi-(core) CPU, to be able to parallel programs must be support concurrency. These two concepts are not strictly differentiated in most cases, and default concurrency refers to concurrency under parallel mechanisms.

Why are concurrent programs so difficult?

We believe that writing correct concurrent, fault-tolerant and scalable applications are too har D. The most of the time it's because we are using the wrong tools and the wrong level of abstraction. --akka

Akka Official document Opening this sentence is good, the reason for writing the correct concurrency, fault tolerant, extensible program is so difficult, because we use the wrong tool and wrong abstraction. (Of course the document originally meant that Akka was the right tool, but we could look at it independently).

So let's start by combing the abstraction of the program. Start our program is process-oriented, data structure +func. Later with object-oriented, the object combines a number of structures and func, and we want to simulate the real world in a way that abstracts objects, states and behaviors. However, either the process-oriented func or the object-oriented func, which is essentially an organizational unit of code blocks, does not itself have the definition of a concurrency policy that contains blocks of code. The concept of thread is introduced in order to solve the concurrency requirement.

Threads (thread)

    1. System kernel state, lighter process
    2. Dispatched by the system kernel
    3. Multiple threads of the same process can share resources

The presence of threads solves two problems, one of which is the need for concurrency mechanisms to ensure user interface responsiveness after GUI appearance. The second is the multi-user problem brought about by the development of Internet. The earliest CGI program is simple, the original standalone version of the program will be wrapped in a process, a user to start a process. But obviously this does not carry many users, and if the process needs to share resources through inter-process communication mechanism, the presence of threads alleviates this problem.

The use of the thread is relatively simple, if you think that the code needs to be concurrent, put it in a separate thread to execute, the system is responsible for scheduling, when to use the thread, how many threads to use, by the caller, but the definition of the caller will not know how to use their own code, many concurrency problems are caused by misuse, For example, the map in Go and Java hashmap are not concurrency-safe, misuse in multi-threaded environment will cause problems. In addition, it brings complexity:

    1. race condition (race conditions) if each task is independent and does not need to share any resources, then the thread is very simple. But the world is often complex and there are always some resources to share, such as the previous example, when developers and marketers need to discuss a plan with the CEO, and the CEO becomes a race condition.
    2. dependencies and order of execution if the tasks between threads have dependencies, you need to wait and notify the mechanism to reconcile. As in the previous example, if the product and CEO discussion of the solution depends on the market and the CEO discussion of the plan, this time requires a coordination mechanism to ensure the order.

To solve these problems, we have introduced a number of complex mechanisms to ensure:

    • The Mutex (lock) (The Go Sync packet, Java's concurrent packet) protects the data by mutual exclusion, but with locks, the concurrency is significantly reduced.
    • Semaphore control the degree of concurrency through semaphores or as a signal-to-thread (signal) notification.
    • Volatile Java specifically introduces volatile keywords to reduce the use of locks in read-only situations.
    • Compare-and-swap guarantees atomicity (atomic) through the CAS mechanism provided by the hardware, and is also a mechanism to reduce the cost of locks.

If the above two problems only add complexity, we can solve it to some extent by deep learning, rigorous codereview, comprehensive concurrency testing (such as unit test in Go language plus-race parameters), which is, of course, controversial, There is a paper that most of the current concurrent program is not a problem only concurrency is not enough, if the number of CPU cores continue to increase, the program runs longer, it is difficult to ensure that no problem. But the biggest headache is the following question:

How many threads do you need in the system?

We start with the hardware resources and consider the cost of the thread:

    • Memory (thread's stack space)
      Each thread requires a stack (stack) space to hold the state when it is suspended (suspending). Java stack space (64-bit VM) By default is 1024k, not the other memory, just stack space, start 1024 threads 1G memory. Although it can be controlled with the-XSS parameter, because the thread is essentially a process, the system assumes that it is running for a long time, and that the stack space is too small to cause a slightly more complex recursive invocation (such as a regular expression match for complex points) that causes the stack to overflow. So adjust the parameters of the temporary cure.
    • Scheduling Cost (Context-switch)
      I did a non-rigorous test on my PC, simulating two threads waking each other in rotation, and thread switching costs about 6000 nanoseconds per second. This does not consider the effect of the size of the stack space. A foreign paper specifically analyzes the cost of thread switching, and basically concludes that the switching cost is directly correlated with the size of the stack space.

    • CPU Usage
      One of the main goals of our concurrency is that we have multicore, want to improve CPU utilization, to maximize the amount of hardware resources, from this point of view, we should use how many threads?

      This we can calculate by a formula, 100/(15+5) *4=20, with 20 threads most suitable. But on the one hand, the time of the network is not fixed, on the other hand, if other bottleneck resources are taken into account? Locks, such as database connection pools, can be more complex.

As a 1-year-old father, think the problem is more difficult than you want to write a child feeding program, need to consider "how much food to feed the child is appropriate?" ", this question has the following answers and strategies:

    • It would be nice if the child did not eat (but the child was playful and did not eat it probably wanted to play)
    • It's good that the child is full (nonsense, how do you know the child is full?) The child does not speak again)
    • Gradually increment, long-term observation, and then calculate an average (this may be the strategy we use to adjust the thread, but how much is the increment appropriate?) )
    • Don't feed your child when he's sick. (If you use a gradual incremental pattern, you may reach this boundary condition by external observation.) System performance If the thread is increasing backwards, do not increase the thread.
    • No control of the border, the child to support the bad (the bear father is too scary.) However, when adjusting the thread, the system may be accidentally hung up.

As we can see from this example, it is very difficult to observe from an external system, or to calculate in an empirical way. So the conclusion is:

Let the child speak, eat enough to say that they learn to eat, self-management is the best program.

But the egg, the computer does not speak itself, how to self-management?

But we can draw a conclusion from the above discussion:

    • High cost of threads (memory, dispatch) cannot be created at scale
    • This problem should be solved dynamically by language or framework

Thread pool Scenario

After Java1.5, Doug Lea's Executor series is included in the default JDK, which is a typical thread pooling scenario.

The thread pool controls the number of threads to a certain extent, implements thread reuse, and lowers the cost of using threads. However, there is still no solution to the problem of quantity, when the thread pool is initialized to set a minimum and maximum number of threads, and the length of the task queue, self-management is only a dynamic adjustment within the set scope. In addition, different tasks may have different concurrency requirements, in order to avoid interaction may require multiple thread pools, and finally the result is that Java's system is flooded with a large number of thread pool.

A new way of thinking

From the previous analysis we can see that if the thread is always running, we only need to set the number of threads equal to the CPU count, so that we can maximize the utilization of the CPU, and reduce switching costs and memory usage. But how do we do that?

Chen on the list, not the person to stop

This is to say that the work of the code snippet is put in the thread, if you do not live (need to wait, blocked, etc.), take off. The popular saying is not to occupy the manger not to pull excrement, if pull out, need brewing under, first put manger let out, because manger is scarce resources.

There are two common scenarios to achieve this:

    1. asynchronous callback scenarios such as Nodejs, encounter blocking situations, such as network calls, register a callback method (in fact, including some contextual data objects) to the IO Scheduler (Linux is Libev, the scheduler in another line thread), the current thread is released, To do something else. When the data is ready, the scheduler passes the result to the callback method and executes it, but the execution is not actually in the thread that originally initiated the request, but is not perceived by the user. But the problem with this approach is that it is easy to encounter callback hell, because all blocking operations must be asynchronous or the system is stuck. There is the asynchronous way a bit contrary to human thinking habits, human is still accustomed to synchronizing the way.

    2. Greenthread/coroutine/fiber Scenario This scenario is essentially different from the scenario above, and the key is the retention of the callback context and the execution mechanism. In order to solve the problem of the callback method, the idea of this scheme is to write the code or write it sequentially, but when encountering blocking calls such as IO, the current code fragment is paused, the context is saved, and the current thread is conceded. such as IO event back, and then find a thread to let the current Code snippet recovery context continue to execute, write code when it feels like synchronization, as if in the same thread, but in fact the system may switch threads, but the program does not feel.

Greenthread

    • User space is the first in the user space to avoid the cost of switching between the kernel state and the user state.
    • Dispatched by a language or framework layer
    • Smaller stack space allows creating a large number of instances (millions)

Several concepts

    • Continuation This concept is not familiar to FP programming people may be unfamiliar, but here can be simply as the name implies, can be understood to let our program can pause, and then the next call to continue (Contine) from the place where the last pause from the beginning of a mechanism. The equivalent of a program call more than one entry.
    • Coroutine is an implementation of continuation, which is generally expressed as a language-level component or class library. Mainly provide yield,resume mechanism.
    • Fiber and Coroutine are in fact one on both sides, mainly from the system level described, can be understood as coroutine after the operation of the thing is Fiber.

Goroutine

Goroutine is actually an evolution and implementation of the front Greenthread series solutions.

    • First, it has a built-in coroutine mechanism. Because of the scheduling of the user state, there must be a mechanism that allows the snippet to be paused/resumed.
    • Secondly, it has built up a scheduler, realizes the multi-threaded parallel dispatching of Coroutine, meanwhile, through the encapsulation of the network and other libraries, the scheduling details are blocked to the user.
    • Finally, the channel mechanism is provided for communication between Goroutine to implement the CSP concurrency model (communicating sequential Processes). Because the Go channel is provided through grammatical keywords, many details are blocked for the user. In fact, the channel of Go and Java Synchronousqueue is the same mechanism, if there is buffer is actually arrayblockqueue.

Goroutine Scheduler

This figure generally refers to the Goroutine scheduler where the reference, want to know carefully can see the original blog. Here are just a few things to note:

    1. m represents the system thread, p stands for the processor (core), and G represents goroutine. Go implements M:N scheduling, which means that there are many-to-many relationships between threads and Goroutine. This is not implemented in many Greenthread/coroutine schedulers. For example, before the Java1.1 version of the thread is actually greenthread (the word comes from Java), but because not implemented many-to-many scheduling, that is, no real implementation of parallel, not play the advantages of multi-core, so later changed to the kernel based on the system thread implementation.
    2. If a system thread is blocked, the goroutine that are arranged on that thread will be migrated. Of course, there are other mechanisms, such as M idle, if the global queue does not have a task, may be from other M steal task execution, equivalent to a rebalance mechanism. There is no more detail here, there is a need to read a special analysis article.
    3. The specific implementation strategy is similar to the mechanism we analyzed earlier. When the system starts, a separate background thread is started (not in Goroutine's dispatch line constructor) to initiate Netpoll polling. When a goroutine initiates a network request, the network Library associates the FD (file descriptor) and Polldesc (the struct that describes the Netpoll, including the goroutine that is blocked by the read/write FD). Then call the Runtime.gopark method to suspend the current goroutine. When the Netpoll poll in the background obtains an event in the Epoll (Linux environment), the Polldesc in the event is removed, the associated blocking goroutine is found, and the recovery is performed.

is goroutine a silver bullet?

Goroutine to a large extent reduces the concurrency of development costs, is not all of us need to concurrency in the place of direct go func to get it done?

Go solves the problem of CPU utilization by dispatching goroutine. But how do you deal with other bottleneck resources? For example, a shared resource with a lock, such as a database connection. Internet application scenario, if each request is thrown into a goroutine, when the resource bottleneck, will cause a large number of goroutine blocking, the last user request time-out. At this point, we need to use the Goroutine pool to control the flow, and the problem is: how many goroutine in the pool is appropriate?

So this problem is still not solved from the more.

Actor model

Actors who have never been in touch with this concept may not understand that the actor concept is similar to the object in Oo and is an abstraction. In the face of object programming, the abstraction of reality is Object = attribute + behavior (method), but when the user invokes the object behavior (method), it actually occupies the caller's CPU time slice, whether concurrency is also determined by the caller. This abstraction is actually different from the real world. The real world is more like an actor's abstraction, which communicates with each other through asynchronous messages. For example, you to a beautiful say hi, beauty whether to respond, how the response is determined by the beauty herself, running in the beauty of her own brain, and will not occupy the brains of the sender.

So the actor has the following characteristics:

    • Processing–actor can do the calculation, does not need to occupy the caller's CPU time slice, the concurrency policy is also decided by oneself.
    • Storage–actor can save state
    • Communication–actor can be communicated by sending messages between

The actor follows these rules:

    • Send a message to another actor
    • Create other actors
    • Accept and process messages, modify your state

Actor's goal:

    • The actor can be updated independently to achieve a hot upgrade. Because actors do not have direct coupling with each other, they are relatively independent entities that may be able to achieve thermal upgrades.
    • Seamlessly bridging local and remote calls because the actor uses a message-based communication mechanism, both with the local actor and the remote actor interaction, the message is passed, bridging the local and remote differences.
    • The communication between the fault-tolerant actors is asynchronous, the sender just sends, does not care about timeouts and errors, which are taken over by the framework layer and the independent error handling mechanism.
    • Easy to expand, natural distributed because the actor's communication mechanism bridges local and remote calls, the actor can be started on the remote node and then forward the message when the local actor is not able to handle it.

Actor's implementation:

    • ERLANG/OTP the Actor model, other implementations have largely referred to the Erlang pattern. Thermal upgrades and distribution are implemented.
    • Akka (Scala,java) is implemented based on thread and asynchronous callback patterns. Because there is no fiber in Java, it is thread-based. To prevent threads from being blocked, all blocking operations in the Akka need to be asynchronous. Either the asynchronous framework provided by Akka, or the future-callback mechanism, is converted into a callback pattern. Distributed, but not yet supported for hot upgrades.
    • Quasar (Java) in order to solve the Akka blocking callback problem, Quasar through bytecode enhancement, in Java implemented Coroutine/fiber. At the same time, through the ClassLoader mechanism to achieve the heat upgrade. The disadvantage is that the system starts by using the javaagent mechanism for byte-code enhancement.

Golang CSP VS Actor

Both of the Maxims are:

Don ' t communicate by sharing memory, share memory by communicating

The mechanism of message communication avoids race condition, but there are some differences in concrete abstraction and implementation.

    • The message and channel in the CSP model are the main body and the processor is anonymous.
      This means that the sender needs to be concerned about the type of message and which channel it should write to, but does not need to care who consumes it and how many consumers it has. Channel is generally type-bound, a channel only writes the same type of message, so the CSP needs to support the alt/select mechanism, while listening to multiple channel. The channel is a synchronous pattern (Golang's channel supports buffer, which supports a certain number of asynchrony), and the logic behind it is that the sender is very concerned about whether the message is handled, and the CSP has to ensure that each message is handled properly and is blocked without being processed.
    • The actor in the Actor model is the subject, and Mailbox (similar to the CSP channel) is transparent.
      That is, it assumes that the sender cares about the message to whom it is consumed, but does not care about the message type and channel. So mailbox is the asynchronous pattern, and the sender cannot assume that the sent message must be received and processed. The actor model must support a powerful pattern-matching mechanism, because whatever type of message is sent over the same channel, it needs to be distributed through a pattern-matching mechanism. The logic behind it is that the real world is inherently asynchronous and uncertain (non-deterministic), so the program also adapts to the uncertain mechanism of programming. Since the parallel, the original programming thinking pattern has been challenged, and the actor directly in the model contains this.

From this point of view, the CSP model is more suitable for the Boss-worker mode of the task distribution mechanism, it is not so intrusive, can be in the existing system through the CSP to solve a specific problem. It does not attempt to resolve the time-out fault tolerance of the communication, which is still required for the initiator to process. And since the channel is explicit, although it is possible to implement the remote channel via Netchan (the Netchan mechanism originally provided by GO is too complex, discarded, and discussed in the new Netchan), it is difficult to be transparent to the user. Actor is a brand-new abstraction that uses actors to face changes in the entire application architecture mechanism and thinking style. It is trying to solve a broader range of problems, such as fault tolerance, such as distributed. But the actor's problem is that with the current scheduling efficiency, even with a mechanism such as goroutine, it is difficult to achieve the efficiency of direct method invocation. At present, as oo "Everything is the object" to achieve a "all actors" language, there must be a problem in efficiency. So the tradeoff is to abstract the components of the system at some level of the actor on the basis of OO.

Pull it again, rust.

Rust's approach to solving concurrency problems is to first acknowledge that real-world resources are always limited, that it is difficult to avoid resource sharing altogether, and that it does not attempt to completely avoid resource sharing, and that the problem of concurrency is not in resource sharing, but in the wrong use of resource sharing. For example, as we mentioned earlier, most language definition types do not limit how callers can use them, but only through documentation or markup (such as @threadsafe in Java, @NotThreadSafe annotation) to show concurrency security. But it can only be a hint, not to prevent the caller misuse. Although go provides the-race mechanism, you can test race conditions with this parameter by running unit tests, but if your unit tests are not concurrent enough, the coverage is not detected. So Rust's solutions are:

    • When defining a type, explicitly specify whether the type is concurrency-safe
    • The introduction of the variable Ownership (Ownership) concept of non-concurrency-safe data structures across multiple threads does not necessarily lead to problems, causing problems in which multiple threads operate simultaneously, that is, because the ownership of the variable is not explicitly caused. With the concept of ownership, a variable can only be manipulated by a scoped code that has ownership, whereas a variable pass causes a change in ownership, limiting the appearance of race conditions from the language level.

With this mechanism, rust can check and limit race conditions at compile time rather than at run time. While development increases the cost of the mind, it is also a unique solution to reduce the mental cost of callers and troubleshoot concurrency problems.

Conclusion

The revolution has not been successful comrades need to work hard

This article takes you through a review of concurrency issues, and solutions. Although each family has the advantages and use of the scene, but the problem of concurrency is far from the level of resolution. So we have to work hard, and everyone has a chance.

Finally throwing a brick idea: to implement actor on Goroutine?

    • Distributed solves the problem of single machine efficiency, is it possible to try to solve the problem of distributed efficiency?
    • and container cluster fusion the current automatic scaling scheme is basically done by monitoring the server or LoadBalancer, setting a threshold value. Similar to the feeding example I mentioned earlier, it is an experience-based approach, but if the system is combined with an external cluster, this can be done more carefully and intelligently.
    • The ultimate goal of self-management in the previous two points is to implement a self-managing system. Do the system operation and maintenance of the students know that we care for the system like child care, at all times to monitor the various states of the system, to accept various alarm system, and then troubleshoot problems, emergency treatment. Children have grown up a day, that can let the system also grow themselves, do self-management? Although this goal now seems to be far, but I think it can be expected.

References and extended reading

    1. Lecture Video of this article
    2. The lecture in this document PDF
    3. CSP Model paper
    4. Actor Model paper
    5. Quantifying the cost of the Context Switch
    6. JCSP libraries that implement CSP models in Java
    7. Overview of modern Concurrency and Parallelism concepts
    8. Discussion on Golang Netchan
    9. Quasar vs Akka
    10. Golang Official blog concurrency is not parallelism
    11. Go Scheduler, the Scheduler picture source in the text
    12. Handling-1-million-requests-per-minute-with-golang a flow control practice with Goroutine

FAQ:

High-availability architecture public Netizen "break": there is a question you say 1024 threads need 1G of space as the stack space to the time when the address space of the thread and process is virtual space when you don't really use this virtual address, you don't map physical memory pages to virtual memory, which means that each thread is not so deep. is not to put all of the stack space critical to memory so that 1024 threads actually don't consume that much memory

A: You're right, the Java heap and stack's memory are virtual memory, and actually starting a thread doesn't take up that much memory immediately. However, threads are long-running, and after stack growth, the space is not recycled, meaning it gradually increases to the limit of XSS. This is just a description of the cost of the thread. In addition, even if the empty thread (sleep after startup), according to my test, 1 core 1G server, boot more than 30,000 threads left and right system hangs up (need to first modify the system thread maximum limit, in/proc/sys/kernel/threads-max), There is a big gap between the millions and the ideal.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.