1. Background 1.1. Amazing performance data
Recently a friend in the circle told me by private messages that they implemented a cross-node remote service invocation of a 1-W TPS (Complex POJO object) by using Netty4 + Thrift compression binary codec technology. Performance has increased by more than 8 times times compared to the traditional communication framework based on the Java serialization +bio (synchronous blocking IO).
In fact, I am not surprised by this data, according to my 5 years of NIO programming experience, by choosing the appropriate NIO framework, coupled with high-performance compression binary codec technology, carefully designed Reactor threading model, to achieve the above performance indicators is completely possible.
Let's take a look at how Netty supports the cross-node remote service invocation of the TPS, and before we begin, let's briefly introduce the next Netty.
1.2. Getting Started with Netty basics
Netty is a high-performance, asynchronous event-driven NIO framework that provides support for TCP, UDP, and file transfers, and as an asynchronous NIO framework, all IO operations of Netty are asynchronous and non-blocking, with the Future-listener mechanism enabling users to proactively obtain Or the result of IO operation is obtained by notification mechanism.
As the most popular NIO framework, Netty has been widely used in the field of Internet, big Data distributed computing, game industry and communication industry, and some industry-famous open source components are also built on the Netty NIO framework.
2. Netty High-performance Road 2.1. Performance Model Analysis 2.1.1 for RPC calls. Three deadly sins of poor performance of traditional RPC calls
Network transmission mode problem: The traditional RPC framework or RMI-based remote service (process) calls using synchronous blocking IO, when the client's concurrency pressure or network Shizhan large, synchronous blocking IO due to frequent wait causes the IO thread regular blocking, because the thread can not work efficiently, I O the processing capacity naturally declined.
Below, we look at the drawbacks of bio communication through the Bio Communication Model diagram:
Figure 2-1 BIO Communication Model Diagram
The service side of the BIO communication model, usually by a separate acceptor thread, is responsible for listening to the client connection, creating a new thread processing request message for the client connection after the client connection is received, and after processing is completed, the reply message is returned to the client, the thread is destroyed, This is a typical request-to-answer model. The biggest problem of this architecture is that there is no elastic scalability, when the concurrent traffic increases, the number of threads on the server is linearly proportional to the number of concurrent accesses, because the thread is a very valuable system resource of the JAVA virtual machine, when the number of threads expands, the performance of the system drops sharply, and as the concurrency increases, Issues such as handle overflow, thread stack overflow, and so on, can occur and cause the server to eventually crash.
Serialization mode problem: Java serialization has several typical problems:
1) The Java serialization mechanism is an object codec technique inside Java, which cannot be used across languages; For example, in the interface between heterogeneous systems, the stream of Java serialization needs to be able to be deserialized into original objects (replicas) through other languages, which is difficult to support at present.
2) compared to other open source serialization frameworks, the Java serialized stream is too large, whether it is network transmission or persistent to disk, will lead to additional resource consumption;
3) Poor serialization performance (high CPU resource consumption).
Threading model problems: Because of the synchronous blocking IO, which causes each TCP connection to occupy 1 threads, because thread resources are a very valuable resource for the JVM virtual machine, when IO read-write blocking causes the thread to not be released in a timely manner, the performance of the system can be drastically degraded, causing the virtual machine to fail to create new threads.
2.1.2. Three themes for high performance
1) Transmission: What kind of channel to send data to each other, BIO, NIO or Aio,io model to a large extent determines the performance of the framework.
2) Protocol: What communication protocol is used, HTTP or internal private protocol. The choice of protocol is different, and the performance model is different. The performance of an internal private protocol can often be better designed than a public agreement.
3) Thread: How does a datagram read? After reading the codec in which thread, the decoded message how to distribute, Reactor threading model, the impact on performance is also very large.
Figure 2-2 RPC invocation performance three elements
2.2. Netty High-performance road 2.2.1. Asynchronous non-blocking communication
During IO programming, multi-threaded or IO multiplexing techniques can be used to process multiple client access requests simultaneously. IO multiplexing technology enables multiple client requests to be processed simultaneously in single-threaded situations by reusing the blocking of several IO to the blocking of the same select. Compared with the traditional multi-threaded/multi-process model, the maximum advantage of I/O multiplexing is that the system overhead is small, the system does not need to create new additional processes or threads, and it does not need to maintain the running of these processes and threads, reduces the maintenance workload of the system and saves the system resources.
JDK1.4 provides support for non-blocking IO (NIO), and the Jdk1.5_update10 version uses Epoll instead of traditional select/poll, greatly improving the performance of NIO communications.
The JDK NIO communication model is as follows:
Figure 2-3 The multiplexed model diagram of NIO
As opposed to the socket class and the ServerSocket class, NIO also provides two different socket channel implementations for Socketchannel and Serversocketchannel. Both of these new channels support both blocking and non-blocking modes. Blocking mode is very simple to use, but performance and reliability are not good, the non-blocking mode is just the opposite. Developers can generally choose the right pattern to suit their needs, and in general, low-load, low-concurrency applications can choose to synchronize blocking IO to reduce programming complexity. However, for high-load, high-concurrency network applications, it is necessary to use NIO's non-blocking mode for development.
The Netty architecture is designed and implemented according to Reactor mode, and its service-side communication sequence diagram is as follows:
Figure 2-3 NIO service-side communication sequence diagram
The client communication sequence diagram is as follows:
Figure 2-4 NIO client communication sequence diagram
Netty IO thread Nioeventloop because of the aggregation of Multiplexer Selector, can concurrently handle hundreds of thousands of client Channel, because the read and write operations are non-blocking, which can fully improve the efficiency of the IO thread, to avoid the frequent IO blocking caused by the line Process hangs. In addition, because the Netty adopts asynchronous communication mode, an IO thread can handle N client connection and read and write operations concurrently, which fundamentally solves the traditional synchronous blocking IO one-thread model, and the performance, elasticity and reliability of the architecture have been greatly improved.
2.2.2.0 copies
Many users have heard that Netty has a "0 copy" function, but it is not clear where it is, this section explains Netty's "0 copies" function in detail.
Netty's "0 copies" are mainly embodied in the following three areas:
1) The Netty receive and send Bytebuffer uses direct buffers to read and write to the Socket using the outside of the heap, and does not require two copies of the byte buffer. If the socket is read and written using traditional heap memory (heap buffers), the JVM copies the heap memory Buffer into direct memory before it is written to the socket. Compared to out-of-heap direct memory, the message has a memory copy of the buffer one more time during the sending process.
2) Netty provides a combo buffer object that aggregates multiple Bytebuffer objects, allowing the user to manipulate the combo buffer as easily as a buffer, avoiding the traditional use of a memory copy to merge several small buffer into a large Buff Er
3) Netty file transfer uses the TransferTo method, which can directly send the file buffer data to the target Channel, avoiding the traditional memory copy problem caused by circular write mode.
Below, we explain the above three kinds of "0 copies", first look at the creation of Netty receive Buffer:
Figure 2-5 Asynchronous message read "0 copy"
Once the message is read once per loop, the Bytebuf object is obtained through the Bytebufallocator Iobuffer method, which continues to look at its interface definition:
Figure 2-6 Bytebufallocator allocating out-of-heap memory through Iobuffer
When Socket IO reads and writes, in order to avoid copying a copy from the heap memory to direct memory, Netty's bytebuf allocator directly creates non-heap memory to avoid two copies of the buffer and improves read and write performance through "0 copies".
Below we continue to see the second "0 copy" Implementation of COMPOSITEBYTEBUF, which is external to a plurality of bytebuf encapsulated into a bytebuf, external to provide a unified package after the Bytebuf interface, its class definition is as follows:
Figure 2-7 Compositebytebuf class inheritance relationship
Through the inheritance relationship we can see that compositebytebuf is actually a bytebuf wrapper, which combines multiple bytebuf into a single set, and then provides a unified Bytebuf interface, which is defined as follows:
Figure 2-8 Compositebytebuf class definition
Add Bytebuf, do not need to do a memory copy, the relevant code is as follows:
Figure 2-9 New "0 copies" of Bytebuf
Finally, we look at the "0 copies" of the file transfer:
Figure 2-10 File Transfer "0 copy"
Netty file Transfer defaultfileregion by TransferTo method to send the file to the target Channel, the following focus on FileChannel TransferTo method, its API DOC is described as follows:
Figure 2-11 File Transfer "0 copy"
For many operating systems it sends the contents of the file buffer directly to the target Channel, without the need for a copy, which is a more efficient way of transmitting a "0 copy" of the file transfer.
2.2.3. Memory Pool
With the development of JVM virtual machine and JIT compilation technology, object allocation and collection is a very lightweight task. But for buffer buffers, the situation is slightly different, especially for the allocation and recycling of out-of-heap direct memory, which is a time-consuming operation. In order to reuse buffers as much as possible, Netty provides a buffer reuse mechanism based on the memory pool. Let's take a look at the implementation of Netty BYTEBUF:
Figure 2-12 Memory Pool Bytebuf
Netty provides a variety of memory management policies that enable differentiated customization by configuring related parameters in the boot helper class.
Following the performance test, we look at the performance differences of BYTEBUF and common bytebuf based on memory pool recycling.
Use case one, create a direct memory buffer using the memory pool allocator:
Figure 2-13 Memory pool-based non-heap memory buffer test case
Use case two, a direct memory buffer created using a non-heap memory allocator:
Figure 2-14 Non-heap memory buffer test case based on non-memory pool creation
Each execution 3 million times, the performance comparison results are as follows:
Figure 2-15 Memory pool and non-memory pool buffer Write performance comparison
The performance test shows that the bytebuf of the memory pool is about 23 times times higher than that of the bytebuf, which is strongly correlated with the performance data.
Let's briefly analyze the memory allocations for the Netty memory pool:
Figure 2-16 Buffer allocation for Abstractbytebufallocator
Continuing to look at the Newdirectbuffer method, we find that it is an abstract method, which is implemented by the subclasses of Abstractbytebufallocator, with the following code:
Figure 2-17 Different implementations of Newdirectbuffer
The code jumps to the Newdirectbuffer method of Pooledbytebufallocator, obtains the memory area Poolarena from the Cache, and calls its allocate method for memory allocation:
Figure 2-18 Memory allocation for Pooledbytebufallocator
The allocate method for Poolarena is as follows:
Figure 2-18 Buffer allocation for Poolarena
We focus on the implementation of NEWBYTEBUF, which is also an abstract method that implements different types of buffer allocations by subclasses Directarena and Heaparena, because the test case uses out-of-heap memory.
Figure 2-19 Newbytebuf abstract method of Poolarena
So focus on the implementation of Directarena: If you do not open the use of sun's unsafe, then
Figure 2-20 Directarena's Newbytebuf method implementation
To execute the Pooleddirectbytebuf newinstance method, the code is as follows:
Figure 2-21 Pooleddirectbytebuf's Newinstance method implementation
Use the Recycler get method to iterate through the Bytebuf object, and if it is a non-memory pool implementation, create a new Bytebuf object directly. After getting bytebuf from the buffer pool, call Abstractreferencecountedbytebuf's Setrefcnt method to set the reference counter for the object's reference count and memory reclamation (similar to the JVM garbage collection mechanism).
2.2.4. Efficient Reactor Threading Model
There are three commonly used Reactor threading models, such as the following:
1) Reactor single-threading model;
2) Reactor multithreading model;
3) Master-slave Reactor multithreading model
Reactor single-threaded model, which means that all IO operations are done on the same NIO thread, the functions of the NIO thread are as follows:
1) as a NIO server, receive TCP connections from clients;
2) as a NIO client, initiate a TCP connection to the server;
3) Read the communication to the end of the request or reply message;
4) Send a message request or reply message to the communication peer.
The Reactor single-threaded model is as follows:
Figure 2-22 Reactor single-threaded model
Since the Reactor mode uses asynchronous non-blocking IO, all IO operations do not cause blocking, and in theory a thread can handle all IO-related operations independently. At the architectural level, a NIO thread can do its job. For example, by acceptor receiving a TCP connection request message from the client, the link is established successfully, and the corresponding Bytebuffer is distributed to the specified Handler by Dispatch for message decoding. User Handler can send messages to the client through the NIO thread.
For some small-capacity scenarios, you can use a single-threaded model. But for high-load, large concurrent applications are not suitable, the main reasons are as follows:
1) A NIO thread handles hundreds of links at the same time, unable to support performance, even if the CPU load of the NIO thread reaches 100%, it can not satisfy the encoding, decoding, reading and sending of mass messages;
2) When the NIO thread is overloaded, processing speed slows down, which causes a large number of client connections to time out, which tends to be re-sent after time-out, which increases the load on the NIO thread, resulting in a large number of message backlogs and processing timeouts, and NIO threads become a performance bottleneck for the system;
3) Reliability problem: Once the NIO thread accidentally flies, or enters the dead loop, it will cause the whole system communication module is not available, can not receive and process external messages, resulting in node failure.
In order to solve these problems, we have evolved the Reactor multithreaded model, below we learn the next Reactor multithreading model.
The biggest difference between the Rector multithreaded model and the single-threaded model is that there is a set of NIO threading processing IO operations, and its schematic diagram is as follows:
Figure 2-23 Reactor multithreaded model
Features of the Reactor multithreaded model:
1) There is a specific NIO thread-acceptor thread used to listen to the server, receiving TCP connection requests from the client;
2) Network IO Operations-read, write, etc. are the responsibility of a NIO thread pool, which can be implemented using a standard JDK thread pooling, which consists of a task queue and N available threads, which are responsible for reading, decoding, encoding, and sending messages;
3) 1 NiO threads can handle N links at the same time, but 1 links only correspond to 1 NIO threads, preventing concurrent operation problems.
In most scenarios, the Reactor multithreaded model can meet performance requirements, but in a very special scenario, a NIO thread is responsible for listening to and handling all client connections that may have performance issues. For example, the million-client concurrent connection, or the server needs to authenticate the client's handshake message, the authentication itself is very lossy performance. In such scenarios, a single acceptor thread may have a performance problem, and in order to solve the performance problem, a third Reactor threading model-master-slave Reactor multithreaded Model-is produced.
The main feature of the master-slave Reactor threading model is that the server is no longer a 1 separate NIO thread for receiving client connections, but rather a separate NIO thread pool. Acceptor the client TCP connection request processing is completed (possibly including access authentication, etc.), the newly created Socketchannel is registered to an IO thread in the IO-thread pool (sub reactor thread pool), which is responsible for Socketchannel read-write and Codec work. The acceptor thread pool is only used for client login, handshake, and security authentication, and once the link is established successfully, the link is registered on the IO thread of the backend Subreactor thread pool, which is responsible for subsequent IO operations by the IO threads.
Its threading model is as follows:
Figure 2-24 Reactor Master-Slave multithreading model
With the master-slave NIO threading model, 1 server-side listener threads are unable to effectively handle all client connection performance problems. Therefore, it is recommended to use this threading model in the official demo of Netty.
In fact, the threading model of Netty is not fixed, and it is possible to support the above three Reactor threading models by creating different Eventloopgroup instances in the boot helper classes and configuring them with the appropriate parameters. Because Netty provides flexible customization capabilities for Reactor threading model support, it can meet the performance demands of different business scenarios.
2.2.5. Non-locking serial design concept
In most scenarios, parallel multithreading can improve the concurrency performance of the system. However, if concurrent access to shared resources is handled improperly, it can lead to severe lock contention, which can eventually result in degraded performance. In order to avoid the performance loss of lock competition as much as possible, it can be accomplished by serialization design, that is, the processing of message is done in the same thread as possible, and the thread switching is not done, so the multi-threading competition and the synchronous lock are avoided.
To maximize performance, the Netty uses a serial, non-locking design that performs serial operations within the IO thread to avoid performance degradation caused by multi-threaded contention. On the surface, the serialization design seems to be low CPU utilization and not enough concurrency. However, by adjusting the thread parameters of the NIO thread pool, you can simultaneously start multiple serialized threads running concurrently, a locally unlocked serial threading design that performs better than one queue-multiple worker threading models.
Netty's serialization design works as follows:
Figure 2-25 Netty Serialization Working principle diagram
Netty's nioeventloop reads the message, calls Channelpipeline's Firechannelread (Object msg) directly, as long as the user does not actively switch threads, and is always called by Nioeventloop to the user's Handler, during which no thread switching is performed, this serialization process avoids the competition for locks caused by multi-threaded operations and is optimal from a performance standpoint.
2.2.6. Efficient concurrent Programming
Netty's efficient concurrent programming is mainly reflected in the following points:
1) substantial and correct use of volatile;
2) wide use of CAS and atomic classes;
3) Use of thread-safe containers;
4) Improve concurrency performance through read-write locks.
If you want to know the details of Netty high-efficiency concurrency programming, you can read the "Multi-threaded concurrent programming application analysis in Netty", which I shared with Weibo in this article, and introduce and analyze the multithreading techniques and applications of Netty in detail.
2.2.7. High-performance serialization framework
The key factors that affect serialization performance are summarized below:
1) The size of the code stream after serialization (network bandwidth occupancy);
2) Serialization & Deserialization performance (CPU resource consumption);
3) whether to support cross-language (heterogeneous system docking and development language switching).
Netty provides support for Google protobuf by default, and by extending the Netty codec interface, users can implement other high-performance serialization frameworks, such as Thrift's compressed binary codec framework.
Let's take a look at the different serialization & deserialization of the framework after the serialization of the byte array comparison:
Fig. 2-26 Comparison of serialization code stream size for each serialized frame
As can be seen, the Protobuf serialized stream is only about 1/4 of the Java serialization. It is because Java native serialization performance is too poor to produce a variety of high-performance open-source serialization technology and framework (poor performance is only one of the reasons, as well as cross-language, IDL definition and other factors).
2.2.8. Flexible TCP parameter configuration capabilities
Setting TCP parameters reasonably can have a significant effect on performance improvements in some scenarios, such as SO_RCVBUF and SO_SNDBUF. If set incorrectly, the impact on performance is very large. Below we summarize several configuration items that have a large impact on performance:
1) so_rcvbuf and SO_SNDBUF: Generally recommended values are either K or n K;
2) The So_tcpnodelay:nagle algorithm can improve the network application efficiency by automatically connecting the small packets in the buffer, forming a large packet, and blocking the sending and blocking network of a large number of small packets. But for the delay-sensitive application scenario, the optimization algorithm needs to be closed.
3) Soft Interrupt: If the Linux kernel version supports RPS (2.6.35 or above), a soft interrupt can be achieved after the RPS is turned on to improve network throughput. RPS calculates a hash value based on the source address of the packet, the destination address, and the destination and source port, and then chooses the CPU to run the soft interrupt according to the hash value, from the upper level, that is, each connection and CPU is bound, and through this hash value, to equalize the soft interrupt on multiple CPUs, improve Network parallel processing performance.
Netty can flexibly configure TCP parameters in the boot helper class to meet different user scenarios. The relevant configuration interfaces are defined as follows:
Figure 2-27 TCP parameter configuration definition for Netty
2.3. Summary
By analyzing the architecture and performance model of the Netty, we found that the high performance of the Netty architecture was well designed and implemented, and it was not very difficult to Netty a cross-node service invocation that supported 10W TPS, thanks to high-quality architecture and code.
Netty series of Netty High-performance Road