How to bring high concurrency down the altar!

Last Update:2018-06-29 Source: Internet

Author: User

Tags epoll

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

“

High concurrency is also a hot word for the past few years, especially in the Internet circle, the openings do not talk about a high concurrency problem, are embarrassed to go out.

Does high concurrency have that damndest? At the same rate, tens of thousands of concurrent, billions of traffic, it sounds really scary. But think about it, so big concurrency and traffic is not all through the router?

All from Nic

High concurrency of traffic through the low-key router into our system, the first hurdle is the network card, how to resist high concurrency network card?

This problem does not exist at all, and in the network card, the same as the same, are electrical signals, network card in the eyes of the fundamental difference is not come out you are tens of thousands of concurrent or a torrent, so the measurement network card cow is said bandwidth, never concurrency volume of the argument.

Network card is located in the physical layer and the link layer, the final data to the network layer (IP layer), the network layer has an IP address, has been able to recognize that you are tens of thousands of concurrent.

So the network layer can be proud to say, I solved the high concurrency problem, can come out to blow bragging. Who's got a network layer? The protagonist is the router, this thing is mainly playing network layer.

Confused

Non-professional, we generally put the network layer (IP layer) and the Transport Layer (TCP layer) together, the operating system provided, for us is transparent, very low-key, very reliable, so that we have to ignore it.

The blown cow starts from the application layer, and all of the application layers originate from the socket, which will eventually pass through the transport layer into thousands of sockets, the ones that have blown, but how to handle these sockets quickly. What is the difference between processing the IP layer data and processing the Socket?

There is no connection, there is no waiting

The most important difference is that the IP layer is not connection-oriented, and the Socket is connection-oriented. The IP layer does not have the concept of a connection, at the IP layer, a packet is processed on one, without having to look after it.

and processing socket, must indecisive, socket is connection-oriented, contextual, read a sentence I love you, excited half a day, you do not look backwards and forwards, is blind excitement.

You want to look back and forth to understand, to occupy more memory to memory, it takes longer to wait, different connections to do a good job of isolation, it is necessary to allocate different threads (or association). All of these are solved well, seemingly still a bit difficult.

Thanks for operating system

The operating system is a good thing, on the Linux system, all the IO is abstracted into a file, network IO is no exception, is abstracted into a Socket.

But the socket is not only an IO abstraction, it also abstracts how to handle the socket, the most famous is the Select and Epoll.

Well-known Nginx, Netty, Redis are based on Epoll, these three guys are basically in the tens of thousands of concurrent areas of the necessary divine skills.

But many years ago, Linux only provided select, which can handle a very small amount of concurrency, and Epoll is designed for high concurrency, thanks to the operating system.

However, the operating system does not solve all the problems of high concurrency, just let the data into our application quickly from the network card, how to deal with is the persistent.

One of the operating system's mission is to maximize the ability to play the hardware, solve the high concurrency problem, which is the most direct and effective solution, followed by distributed computing.

The Nginx, Netty, and Redis we mentioned earlier are examples of maximizing hardware capabilities. How to maximize the ability to play the hardware?

Core contradictions

To maximize the ability to play the hardware, the first to find the core contradiction lies. I think that this core contradiction from the beginning of the birth of the computer until now, almost no change, is the CPU and IO between the contradiction.

The CPU was brutally developed at the speed of Moore's law, while the IO device (disk, NIC) was lackluster. The Turtle speed IO device becomes a performance bottleneck, which inevitably leads to low CPU utilization, so increasing CPU utilization is almost synonymous with the ability to perform hardware.

Interrupts and caches

The CPU and IO device collaboration is basically interrupted, such as the operation of the read disk, the CPU is just a read disk-to-memory instructions to disk drive, and then immediately returned.

At this point the CPU can continue to do other things, read the disk to the memory itself is a very time-consuming work, such as disk drive execution instructions, will send an interrupt request to the CPU, tell the CPU task has been completed, CPU processing interrupt request, at this time the CPU can directly manipulate the data read to the memory.

The interrupt mechanism allows the CPU to handle IO problems at minimal cost, so how to improve the utilization of the equipment? The answer is cache.

The internal operating system maintains a cache of IO device data, including read and write caches. Read caching is easy to understand, and we often use caching at the application level to avoid generating read IO as much as possible.

The Write Cache application layer uses a few, operating system write cache, which is designed to improve the efficiency of IO writing.

The operating system merges and dispatches the cache when it writes IO, such as a write disk that uses an elevator scheduling algorithm.

Efficient use of network cards

The first thing to solve in high concurrency is how to use the network card efficiently. The network card and the disk, the internal also has the cache, the NIC receives the network data, first stores the net card cache, then writes the operating system kernel space (memory), our application reads in the memory the data, then processes.

In addition to the network card has a cache, the TCP/IP protocol also has a send buffer and a receive buffer, as well as the SYN backlog queue, accept backlog queue.

These caches, if not properly configured, can cause a variety of problems. For example, in TCP connection phase, if the concurrency is too large, and Nginx inside the Socket set value is too small, it will cause a large number of connection requests failed.

If the network card cache is too small, when the cache is full, the network card will directly discard the newly received data, resulting in packet loss.

Of course, if our application is not efficient at reading the network IO data, it will accelerate the stack of network card cache data. How to read network data efficiently? Epoll is now widely used in Linux.

The operating system abstracts the IO device into a file, and the network is abstracted into the socket,socket itself as a file, so the Read/write method can be used to read and send the network data. In high concurrency scenarios, how to efficiently use sockets to quickly read and send network data?

To make efficient use of IO, you must understand the IO model at the operating system level, and summarize five IO models in the classic UNIX network programming, namely:

Blocking IO

Non-blocking IO

Multiplexed IO

Signal-driven IO

Asynchronous IO

Blocking IO

For example, when we call the Read method to read the data on the socket, if the socket read cache is empty at this point (no data is sent from the other end of the socket), the operating system suspends the thread that calls the Read method until it has data in the socket read cache, and the operation The system then wakes the thread.

Of course, the Read method also returns the data as it wakes up. I understand that the so-called blocking is whether the operating system hangs threads.

Non-blocking IO

For non-blocking IO, if the Socket read cache is empty, the operating system does not suspend the thread that calls the Read method, but returns an eagain error code immediately.

In this scenario, the Read method can be polled until the data is read by the Socket's read cache, and the disadvantage of this approach is that it consumes a lot of CPU.

Multiplexed IO

For blocking IO, because the operating system suspends the calling thread, if you want to process multiple sockets at the same time, you must create multiple threads accordingly.

Threads consume memory and increase the load on the operating system for thread switching, so this pattern is not suitable for high concurrency scenarios. Is there a way to lower the number of threads?

Non-blocking IO seems to work, and polling multiple sockets in a thread can seem to solve the problem of the number of threads, but this scenario is actually invalid.

The reason is that calling the Read method is a system call, and the system call is implemented by a soft interrupt, which causes the user-state and kernel-state to be switched, so it is slow.

But this idea is right, there is no way to avoid system calls it? Yes, it is multiplexed IO.

On the Linux system select/epoll these two system APIs support multiplexing IO, through these two APIs, a system call can monitor multiple sockets, as long as a socket read cache has data, the method returns immediately.

Then you can read this readable socket, if all the socket read cache is empty, it will block, that is, the thread that calls Select/epoll is suspended.

So Select/epoll is essentially blocking IO, except that they can monitor multiple sockets at the same time.

The difference between select and Epoll

Why are there two system APIs for the multiplexed IO model? I analyze the reason that select is defined in the POSIX standard, but performance is not good enough, so the various operating systems have introduced better performance APIs, such as Epoll on Linux, IOCP on Windows.

As to why select is slow, there are two reasons why we are more acceptable:

One thing is that after the Select method returns, you need to traverse all the monitored sockets, not the changed sockets.

Another point is that every time you call the Select method, you need to copy the bitmap of the file descriptor in both the user and kernel states (by calling three times the Copy_from_user method to copy the read, write, and exception three bitmaps).

Epoll can avoid the two points mentioned above.

Reactor multithreaded Model

In the Linux operating system, the most reliable and stable IO mode is multiplexing, how can our application use a lot of multiplexed io?

After many years of practice summary, made a Reactor model, the current application is very extensive, the famous Netty, Tomcat NIO is based on this model.

The core of the Reactor is the event dispatcher and event handlers, which are the hubs that connect multiplexed IO and network data processing, and listen for Socket events (select/epoll_wait).

Events are then distributed to the event handlers, and both the event dispatcher and the event handler can be based on the thread pool.

It is important to mention that there are two major types of events in the Socket event, One is the connection request, the other is a read-write request, the connection request is successfully processed and a new socket is created, and the read-write request is based on the newly created socket.

So in the network processing scenario, the implementation of the Reactor mode will be a little bit around, but the principle has not changed.

The implementation can refer to Doug Lea's "Scalable IO in Java" (http://gee.cs.oswego.edu/dl/cpjslides/nio.pdf).

Reactor schematic diagram

Nginx Multi-process model

Nginx By default is a multi-process model, Nginx is divided into Master process and Worker process.

The only worker process that is really responsible for listening for network requests and processing requests, all worker processes listen to the default port 80, but each request is processed by only one worker process.

The trick is that each process must scramble for a lock before receiving the request, and the process that gets the lock has permission to handle the current network request.

There is only one main thread for each Worker process, and the advantage of single threading is that there is no lock handling and no lock processing of concurrent requests, which is basically the highest level in high concurrency scenarios. (Refer to Http://www.dre.vanderbilt.edu/~schmidt/PDF/reactor-siemens.pdf)

Data through the network card, operating system, network protocol middleware (TOMCAT, Netty, etc.), and finally to our application developers hands, how do we handle these high concurrent requests? We should consider this problem from the point of view of upgrading the processing capability of single machine.

Breaking the barrel theory

We should consider this problem from the perspective of upgrading the processing capability of single machine, in the actual application scenario, the problem is how to improve the CPU utilization (who call it the fastest development).

The bucket theory says the shortest plate determines the water level, so why not improve the utilization of the short-board IO, but to improve the utilization of the CPU?

The answer to this question is that in practical applications, increasing CPU utilization tends to increase the utilization of IO at the same time.

Of course, there is no point in improving CPU utilization when IO utilization is nearing its limit. Let's take a look at how to improve CPU utilization, and then see how to improve the utilization of IO.

Parallelism and concurrency

The main way to improve CPU utilization is to use CPU multicore for parallel computing, which is different between concurrency and parallelism.

On a single-core CPU, we can listen to MP3, while Coding, this is concurrent, but not parallel, because in the single-core CPU field of view, listening to MP3 and Coding is not possible at the same time.

Parallel computing is only possible in the multi-core era. Parallel computing This thing is too advanced, the model of industrial application mainly has two kinds, one is the shared memory model, the other is the message delivery model.

Multithreaded design mode

For the shared memory model, its basic principle comes from the master Dijkstra in a half-century ago (1965) of a paper "cooperating sequential Processes".

This paper presents a well-reputed conceptual semaphore, and the wait/notify used in Java for thread synchronization is also an implementation of the semaphore.

Master's things do not understand, learn not to feel ashamed, after all, the master of antecedents children also few.

Toyo has a call Jcs Hao's summed up the experience of multi-threaded programming, wrote a book called "Java Multithreaded design mode", this is quite grounded gas (can read), the following brief introduction.

Single Threaded execution

This mode is to turn multi-threading into a single thread, multi-threaded at the same time access to a variable, there will be a variety of inexplicable problems, this design mode directly to the multithreading into a single-threaded, so security, of course, performance is down.

The simplest implementation is to use synchronized to protect the code blocks (methods) that have security implications.

In the concurrency domain there is a critical section (Criticalsections) concept, and I feel that this pattern is one thing.

Immutable Pattern

If the shared variable is never changed, then multiple thread accesses will have no problem and will always be safe. This model is simple, but good, can solve a lot of problems.

Guarded suspension Patten

This pattern is actually a wait-notification model that suspends the current thread (wait) when the thread execution condition is not met, wakes up all waiting threads (notifications) when the condition is met, and uses Synchronized,wait/notifyall in the Java language A wait-for-notification model can be implemented quickly.

Jcs this model to the multi-threaded version of If, I think very appropriate.

balking

This pattern is similar to the previous one, with the difference being that it exits directly when the thread execution condition is not satisfied, rather than suspending it as in the previous pattern.

The most widely used scenario is a multithreaded version of the Singleton pattern, where objects are created (without satisfying the conditions of the created object) and no more objects (exits) are created.

Producer-consumer

Producer-consumer model, known all over the world. The most I touch is a threading io (such as querying a database) and one (or more) threads processing IO data so that both IO and CPU are fully leveraged.

If both the producer and the consumer are CPU-intensive, then the producer-consumers are themselves making trouble for themselves.

Read-write Lock

Read-Write locks address performance issues in less-readable scenarios and support parallel reads, but write operations allow only one thread to do it.

If the write operation is very small, and the amount of read concurrency is very large, consider using copy on write technology at this time, I personally think that the write-time copy should be used as a single mode.

Thread-per-message

is a request that we often refer to as a thread.

Worker Thread

An upgraded version of the request thread that solves the performance problems caused by the frequent creation and destruction of threads using the threading pool. The BIO-era Tomcat is the model used.

Future

When you call a time-consuming synchronization method that is annoying and wants to do something else at the same time, consider using this pattern, which is essentially a synchronous mutation step converter.

Synchronization can become asynchronous, essentially starting another thread, so this pattern and a request thread is somewhat related.

two-phase termination

This pattern solves the need to gracefully terminate threads.

Thread-specific Storage

Thread local storage, avoid locking, unlock the cost of the weapon, C # inside a support for the container concurrentbag is the use of this mode.

The fastest database connection pool on the planet HIKARICP borrowed from the implementation of Concurrentbag, made a Java version, interested students can refer to.

Active Object (This does not speak)

This model is equivalent to the last palm of the 18 palm of the dragon, combined with the design pattern of the front, a bit complex, the personal feel that the significance of reference is more than the implementation.

Recently the Chinese have also had several related books, but overall still is Jcs Hao this book more can withstand scrutiny. Based on the shared memory model to solve concurrency problems, the main problem is to use a good lock.

But with a good lock, still have difficulty, so later someone made a message delivery model.

Message Delivery model

Shared memory model difficulty is still very big, and you can not theoretically prove that the program is correct, we always accidentally wrote a deadlock program, whenever there is a problem, there will always be a master out.

So the messaging (message-passing) model was born (in the 70 's), and the messaging model had two important branches, one Actor model and one CSP model.

Actor model

The Actor model became famous for Erlang, and later Akka appeared. In the actor model, there is no concept of the process, thread in the operating system, all actors, and we can think of the actor as a more versatile, better-used thread.

Within the actor is the linear processing (single threaded), the actor interacts with the message, that is, the actor is not allowed to share data. Without a share, there is no need for a lock, which avoids the various side effects of the lock.

There is no difference between Actor creation and new object, which is very fast and small, unlike the slow and expensive creation of threads.

The Actor's dispatch is also not like a thread that causes the operating system context to switch (mainly the storage and recovery of various registers), so the scheduling consumes also very little.

Actor also has a somewhat controversial advantage, the actor model closer to the real world, the real world is also distributed, asynchronous, message-based, especially Actor for exception (failure) processing, self-healing, monitoring, etc. are more in line with the real world logic.

But this advantage has changed the thinking habit of programming, and most of our current programming thinking habits actually have many differences from the real world. Generally speaking, the change of our thinking habits, resistance is always beyond our imagination.

CSP model

Golang supports the CSP model at the language level, a sensory difference between the CSP model and the Actor model is that in the CSP model, the producer (the message sender) and the consumer (message receiver) are completely loosely coupled, and the producer is completely unaware of the consumer's presence.

But in the Actor model, the producer must know the consumer, otherwise there is no way to send the message.

The CSP model is similar to the producer-consumer model we mentioned in multithreading, and the core difference I feel is that there is something like green thread in the CSP model.

The green thread in Golang is called the co-process, and the co-process is also a very lightweight scheduling unit that can be created quickly and with very low resource usage.

Actor to some extent need to change our way of thinking, and the CSP model does not seem so big, more easily accepted by the current developers, all say Golang is an engineering language, in the Actor and CSP choice, can also see this embodiment.

Diverse World

In addition to the messaging model, there is an event-driven model, a functional model. The event-driven model is similar to the observer pattern, in which the producer of the message must know that the consumer can send the message,

In the event-driven model, the consumer of the event must know the producer of the message to register the event-handling logic.

Akka consumers can cross the network, the specific implementation of the event-driven model, such as VERTX, consumers can also subscribe to cross-network events, from this point of view, everyone is complementary.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More