The design of message middleware based on ZEROMQ

Last Update:2018-07-26 Source: Internet

Author: User

Tags switches cpu usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This thesis is mainly about how to design and implement a message layer which is more popular in inquiry learning.

ØMQ is a messaging system, or, if you like, you can call it "message-oriented middleware." It is used in a variety of environments such as financial services, game development, embedded systems, academic research and aerospace.

Messaging systems basically work like an instant message from an application. The application decides to deliver the event to another application (or multiple applications), it assembles the data to be sent, clicks the "Send" button, and the messaging system is responsible for the rest. However, unlike instant messaging, messaging systems do not have a GUI, and when problems occur, no one at the endpoint is able to intervene intelligently. Therefore, the messaging system must be fault-tolerant and much faster than the common instant messaging delivery. ØMQ was originally conceived to be a fast messaging system for stock trading, so the focus is on extreme optimization. The first year of the project was used to design a baseline approach and try to define a schema that is as efficient as possible. Later, around the second year of development, the focus shifted to providing a common system for building distributed applications and supporting arbitrary message patterns, multiple transport mechanisms, arbitrary language bindings, and so on. In the third year, the focus was mainly on improving usability and flattening the learning curve. We used the BSD socket API to try to clear the semantics of a single message pattern, and so on. This article will take a closer look at how these three goals translate into the internal architecture of ØMQ and provide tips or tricks for those who are trying to solve the same problem. From the third year onwards, ømq its code base has grown too large; So there is an initiative to standardize the wired protocols it uses, and to experiment with implementing a ØMQ-like messaging system in the Linux kernel. These topics are not covered here. However, you can obtain online resources for more detailed information. application vs. Library

ØMQ is a message store, not a messaging server. We spent a few years studying the AMQP protocol, a wired protocol in which the financial industry tried to standardize enterprise messaging, wrote reference implementations for it and participated in several large-scale messaging-based projects, and eventually realized that using classic clients/ There is a problem with the method of the server model's Intelligent Messaging Server (proxy) and the dumb messaging client.

Our primary concern is performance: if there is a server in the middle, each message must pass through the network two times (from the sender to the agent, from the agent to the receiver), which will have a certain cost in terms of latency and throughput. In addition, if all messages are passed through proxies, the server must become a bottleneck at some point.

A secondary concern is a large-scale deployment: The concept of Central authorization to manage the entire message flow no longer applies when deploying across organizations, such as companies. Because of trade secrets and legal liability, no company is willing to give control to servers in different companies. In practice, the result is that each company has a messaging server that connects to another company's messaging system with a bridge. The entire system is therefore heavily dispersed, and maintaining a large number of bridges for each involved company does not make the situation better. To solve this problem, we need a fully distributed architecture in which each component can be controlled by different business entities. Considering that the snap-in in a server-based architecture is a server, we can address these issues by installing a separate server for each component. In this case, we can further optimize the design by allowing the server and components to share the same process. So we finally get a message base. When Ømq started, we had an idea of how to make the message work without a central server. It needs to reverse the entire concept of the message and, based on the end-to-end principle, uses the "smart endpoint, dumb network" architecture to replace the model of the messages that are centrally stored in the autonomous central storage network. The technology of this decision will determine that ømq from the beginning is a message store, not an application.

We have been able to demonstrate that this architecture is more efficient (lower latency, higher throughput) and more flexible than standard methods (it is easy to build arbitrarily complex topologies rather than qualify as classic Hub-and-spoke models).

One of the unexpected results is that choosing a library model improves the availability of the product. Time and time again, users are happy because they don't have to install and manage separate messaging servers. It turns out that no server is a preference because it lowers operational costs (there is no need for a messaging server administrator) and accelerates time-to-go (no need to negotiate with the customer on whether to run the server, or to manage or operate the team). The lesson learned is that when you start a new project, you should choose the library design if possible. From a simple program call library it is easy to create an application; However, it is almost impossible to create a library from an existing executable file. The library model provides users with more flexibility and saves them from unnecessary administrative work. Global State

Global variables do not interact well with the library. Even if there is only one set of global variables, the library may be loaded more than once in the process. Figure 1 shows a case of a ØMQ library that is used from two separate libraries. Then the application uses these two libraries as an example

Figure 1:ømq Library is used in two different stand-alone libraries

When this happens, two instances of ØMQ access the same variable, resulting in race conditions, strange errors, and undefined behavior. To prevent this problem from appearing, there is no global variable in the ØMQ library. Instead, the user of the library is responsible for explicitly creating global state variables. The object that contains the global state is called the context. Although from the user's point of view, the context looks more or less like a worker thread pool, but from a ømq perspective, it's just an object that stores any global state we happen to need. In the picture above, Liba has his own Context,libb also has its own context. There is no way to let one of them destroy or subvert another.

The lesson here is obvious: do not use the global state in the library. If you do this, the library is likely to be interrupted when it happens to be instantiated two times in the same process. Performance

When the ØMQ project starts, its primary goal is to optimize performance. The performance of a messaging system is expressed using two metrics: throughput-How many messages can be delivered at a given time; Delay-How long the message takes from one endpoint to another.

Which indicator should we pay attention to? What is the relationship between the two. It's not obvious. Run the test, dividing the total time of the test by the number of messages passed, resulting in a delay. The number of messages per unit time is throughput. In other words, the delay is the inverse of the throughput. It's simple, right.

It took us weeks to evaluate performance metrics in detail rather than start coding immediately, so we found that the relationship between throughput and latency is far less straightforward and counterintuitive.

Imagine a sending message to B (see Figure 2). The total time for the test is 6 seconds. 5 messages have been passed. Therefore, the throughput is 0.83 messages per second (5/6) and the delay is 1.2 seconds (6/5), right.

Figure two: Sending messages from A to B

And look at figure two. Each message needs a different time from A to B: 2 seconds, 2.5 seconds, 3 seconds, 3.5 seconds, 4 seconds. The average is 3 seconds, which is quite different from the 1.2 seconds we originally calculated. This example shows a misunderstanding of the visual tendencies of performance indicators.

Now let's look at throughput. The total time for the test is 6 seconds. However, for a, it takes only 2 seconds to send out all the messages. From a point of view, the throughput is 2.5 msgs/sec (5/2). For B, it takes 4 seconds to receive all messages. So from the point of view of B, the throughput is 1.25 msgs/sec (5/4). These numbers do not match the results of the 1.2 msgs/sec we originally calculated.

Long story short: latency and throughput are two different indicators; This is obvious. It is important to understand the difference between the two and their relationship. A delay can only be measured between two different points in the system; There is no concept of delay alone at point A. Each message has its own delay. You can get the average delay of multiple messages; There is no delay in message flow.

On the other hand, throughput can only be measured at a single point in the system. The sender has a throughput, the receiver has a throughput, and any intermediate points between the two have a throughput, but not the overall throughput of the entire system. Throughput is only meaningful for a set of messages; There is no concept of the throughput of a single message.

As for the relationship between throughput and latency, it turns out that there really is a relationship; However, the formula involves integrals and we will not discuss it here. For more information, read the literature on queuing theory. There are a lot of pitfalls in the benchmarking message system, and we won't go any further. We should focus on the lessons learned: Make sure you understand the problem you are solving. Even a simple question, "make the program faster" also requires a lot of work to understand correctly. More importantly, if you don't understand the problem, you might build implicit assumptions and popular myths in your code, making the solution flawed, or at least more complex or less likely. Critical Path

We found three factors in the optimization process that have a critical impact on performance: Memory allocation number system invocation number concurrency model

However, not every memory allocation or each system call has the same effect on performance. Our performance of interest in messaging systems is the number of messages we can transfer between two endpoints at a given time. Or, we might be interested in how long the message takes from one endpoint to another.

However, since ØMQ is designed for scenarios with long connections, the time required to establish a connection or to handle connection errors is largely irrelevant. These events rarely occur, so their impact on overall performance can be negligible.

The frequently used part of a code base is called the critical path; Optimization should focus on critical paths.

Let's take a look at an example: ØMQ does not greatly optimize memory allocation. For example, when manipulating a string, it usually assigns a new string to each intermediate stage of the transformation, but if we strictly look at the critical path (the actual message delivery), we will find that it hardly uses memory allocations. If the message is small, there is only one memory allocation per 256 messages (these messages are stored in a large allocated memory block). In addition, if the message flow is stable and there is no significant traffic spike, the number of memory allocations on the critical path will be reduced to 0 (the allocated memory blocks will not be returned to the system, but reused).

Lessons learned: Where optimization produces significant differences. It is not valid to optimize code snippets that are not on the critical path. Allocating Memory

Assuming all the infrastructure has been initialized, and the connection between the two endpoints is established, only one thing needs to be allocated memory when sending the message: the message itself. Therefore, in order to optimize the critical path, we must study how to allocate memory for the message and pass it up and down in the stack.

The common sense in high-performance networking is to achieve optimal performance by carefully balancing the cost of message allocation memory and the cost of message replication (for example, the different processing of small, medium, and large messages). For small messages, replication is less expensive than allocating memory. Instead of allocating new memory blocks at all, it is meaningful to copy the messages to the pre-allocated storage when needed. On the other hand, for large messages, replication is more expensive than memory allocation. It is meaningful to assign the message once and pass the pointer to the allocated block instead of copying the data. This method is called a "0 copy".

The ØMQ handles both cases in a transparent manner. The ØMQ message is represented by an opaque handle. The content of very small messages is encoded directly in the handle. Therefore, the copy handle actually replicates the message data. When the message is large, it is allocated in a separate buffer, and the handle contains only pointers to buffers. Creating a copy of a handle does not result in the replication of the message data, which makes sense when the message is megabytes long (Figure 3). It should be noted that in the latter case, the buffer is referenced to count so that it can be referenced by multiple handles without the need to replicate the data.

Figure three: Message copy (or no message copy)

Lesson: When considering performance, don't assume that there is a single best solution. What can happen is that there are multiple subclasses of the problem (for example, small messages vs. big messages), each with its own best algorithm. Batching

As already mentioned, the number of certain system calls in the messaging system can lead to performance bottlenecks. In fact, this problem is more common than that one. There is no small performance penalty when traversing the call stack, so it is advisable to avoid as much of the stack traversal as possible when creating high-performance applications.

Consider figure 4. To send four messages, you must traverse the entire network stack four times (ØMQ,GLIBC, user/Kernel space boundary, TCP implementation, IP implementation, Ethernet layer, NIC itself and re-backup stack).

Figure Four: Send four messages

However, if you decide to merge these messages into a single batch message, you only have to traverse the stack once (Figure 5). The impact on message throughput can be significant: up to two orders of magnitude, especially if the message is small and hundreds of of them can be packaged into one batch message.

Figure V: Batching messages

On the other hand, batching can have a negative impact on latency. Let's give an example of a well-known Nagle algorithm that is implemented in TCP. It delays the outbound message for a certain amount of time and merges all the accumulated data into a single packet. Obviously, the end-to-end wait time for the first message in the group is much more than the last wait time. Therefore, this is common for applications that need to get a consistent low latency to shut down the Nagle algorithm. Batching is often turned off even on all levels of the stack (for example, the interrupt merge feature of the NIC). But no batching means a lot of traversing the stack and causing low message throughput. We seem to be stuck in the dilemma of weighing throughput and latency.

ØMQ attempts to provide consistent low latency and high throughput by using the following policies: When the message flow is sparse and does not exceed the bandwidth of the network stack, ØMQ turns off all batches to increase latency. The tradeoff here is to some extent making CPU usage higher (we still need to traverse the stack often). However, this is not considered a problem in most cases.

When the message rate exceeds the bandwidth of the network stack, messages must be queued (stored in memory) until the stack is ready to accept them. Queuing means that the delay will grow. If the message takes a second in the queue, the end-to-end delay will be at least 1 seconds. Worse, as the queue size increases, the latency increases gradually. If there is no limit to the size of the queue, the delay may exceed any limit.

It has been observed that even if the network stack is tuned to the lowest possible latency (Nagle algorithm is turned off, Nic interrupt merging is turned off, etc.), delays may still be frustrating due to queuing effects, as described above.

In this case, it is meaningful to start bulk processing in large quantities. Nothing is lost because the delay is already high. On the other hand, a large number of batches increase throughput and can empty the queue for unfinished messages-which in turn means that the wait time will gradually decrease as the queuing delay decreases. 　　Once there are no unfinished messages in the queue, you can turn off bulk processing to further improve latency. Another observation is that batching should only be done at the highest level. If the message is batched there, the lower layer does not need batching anyway, so all of the following batching algorithms do nothing except introduce additional wait times.

Lesson: To get the best throughput and optimal response time in an asynchronous system, turn off the bulk algorithm at the bottom of the stack and batch at the highest level. Batch processing occurs only when new data arrives faster than the data that can be processed. Architecture Overview

So far, we have focused on making ØMQ a fast universal principle. Now, let's look at the actual architecture of the system (Figure 6).

Figure VI: ØMQ architecture

Users interact with ØMQ using the so-called "sockets". They are very similar to TCP sockets, the main difference being that each socket can handle communication with multiple peers, a bit like an unbound UDP socket.

The socket object exists in the user line approached (see the discussion of the threading model in the next section). In addition, ØMQ runs multiple worker threads to handle the asynchronous parts of communication: reading data from the network, queuing messages, accepting access connections, and so on.

There are various objects in the worker thread. Each object is owned by a parent object (ownership is represented by a simple solid line in the diagram). The parent object can be in a different thread than the child object. Most objects are owned directly by sockets; However, there are several cases where an object is owned by an object owned by the socket. What we get is an object tree, and each socket has one such tree. This tree is used during closing; No object can be closed by itself until it closes all child objects. This allows us to ensure that the shutdown process works as expected; For example, waiting for an outbound message to be pushed to the network takes precedence over the end of the send process.

Broadly speaking, there are two types of asynchronous objects: objects that are not involved in messaging and other objects. The former mainly do connection management. For example, the TCP listener object listens for incoming TCP connections and creates an engine/session object for each new connection. Similarly, the TCP connector object tries to connect to the TCP peer, and when it succeeds, it creates an engine/session object to manage the connection. When such a connection fails, the connector object attempts to reestablish the connection.

The latter is the object that is processing the data transfer itself. These objects consist of two parts: the Session object is responsible for interacting with the ØMQ socket, and the engine object is responsible for communicating with the network. There is only one session object, but there are different engine types for each underlying protocol supported by ØMQ. Therefore, we have TCP engine, IPC (interprocess communication) engine, PGM engine (reliable Multicast protocol, see RFC 3208), etc. The engine set is extensible (in the future we can choose to implement the WebSocket engine or the SCTP engine).

The session exchanges messages with the socket. Messages are delivered in two directions, and each direction is handled by a pipeline object. Each pipeline is essentially an optimized lock-free queue that is used to quickly deliver messages between threads.

Finally, there is a context object (discussed in the previous section, but not shown in the diagram), which holds the global state and can be accessed by all sockets and all asynchronous objects. Concurrency Model

One of the requirements of ØMQ is the use of multi-core computers; 　　In other words, the throughput can be scaled linearly based on the number of available CPU cores. Our previous message system experience shows that using multiple threads in a classic way (critical section, semaphore, etc.) does not result in many performance improvements. In fact, even on multi-core measurements, the multithreaded version of a messaging system may be slower than a single-threaded version. 　　Individual threads spend too much time waiting for each other, while triggering a large number of context switches to slow down the system. In view of these problems, we decided to adopt different models. The goal is to avoid a full lock and let each thread run at full speed. Communication between threads is provided through asynchronous messages (events) that pass through the thread. This is the classic Actor model.

The idea is to start a worker thread for each CPU core (having two threads sharing the same core only means that many context switches have no particular advantage). Each internal Ømq object, for example, a TCP engine, binds to a specific worker thread. This in turn means that there is no need for critical areas, mutexes, semaphores, etc. In addition, these ØMQ objects do not migrate between CPU cores to avoid the negative impact of cache pollution on performance (Figure 7)

Figure VII: Multiple worker threads

This design has led to the disappearance of many traditional multithreading problems. However, there is a need to share worker threads among many objects, which in turn means that some kind of collaborative multitasking is required. This means that we need a scheduler; The object needs to be event-driven, rather than controlling the entire event loop. That is, we have to deal with any sequence of events, even very rare events, and we must make sure that no object holds the CPU too long; Wait a minute

In short, the entire system must be completely asynchronous. No object can do a blocking operation because it will not only block itself, but also block all other objects that share the same worker thread. All objects must be state machines, whether explicit or implicit. With hundreds of or thousands of state machines running in parallel, you have to deal with all the possible interactions between them, and most importantly the shutdown process.

It turns out that shutting down a fully asynchronous system in a clean way is a very complex task. Trying to shut down 1000 moving parts, some of which work, some idle, some in the boot process, some of which have been self-closing, prone to various race conditions, resource leaks and similar situations. Shutting down a subsystem is definitely the most complex part of ØMQ. A quick check of the bug tracker indicates that about 30%-50% of reported errors are related to shutting down in some way.

Experience gained: Consider the actor model when trying to achieve optimal performance and scalability; It is almost the only method in this case. However, if you do not use a dedicated system such as Erlang or ØMQ, you must manually write and debug a large number of infrastructure. Also, from the beginning, think about the process of shutting down the system. It will be the most complex part of the codebase, and if you don't know how to implement it, you should be able to reconsider using the actor model. Lock-free Algorithms

The lock-free algorithm has been popular lately. They are a simple mechanism for inter-thread communication and do not depend on the synchronization primitives provided by the kernel, such as mutexes or semaphores; Instead, they use atomic CPU operations, such as Atomic Compare-and-swap (CAS), to synchronize. It should be understood that they are not literally unlocked, but are locked behind the scenes at the hardware level.

ØMQ uses a lock-free queue in a pipeline object to pass messages between the user's thread and the work thread of the ØMQ. Ømq How to use a lock-free queue has two interesting aspects.

First, each queue has only one write thread and one read thread. If 1 to n traffic is required, multiple queues are created (Figure 8). Given this approach, the queue does not have to care about the synchronous writer (only one writer) or the reader (only one reader), which can be implemented in an extra efficient manner.

Figure Eight: Queues

Second, we realize that although the lock-free algorithm is more efficient than the traditional mutex-based algorithm, atomic CPU operations are still expensive (especially when there is contention between CPU cores), and the atomic operation of each message written and/or per message reads more slowly than we can accept.

The way to speed up is to batch it again. Imagine that you have 10 messages to write to the queue. This can happen, for example, when you receive a network package that contains 10 small messages. The receiving packet is an atomic event; So you don't just get half. This atomic event causes the need to write 10 messages to the lock-free queue. There is not much point in performing atomic operations on each message. Instead, you can accumulate messages in the "Pre-write" section of the queue, which is accessed only by the writer thread and then refreshed with a single atomic operation.

The same applies to reading from the queue. Imagine that the above 10 messages have been flushed to the queue. The reader thread can use atomic operations to extract each message from the queue. However, it is super-kill; Instead, it can use a single atomic operation to move all pending messages to the "read-Ahead" section of the queue. It can then retrieve the message individually from the read-ahead buffer. Read-ahead is owned and accessed only by the reader thread, so no synchronization is required at this stage.

Figure 9 The arrow on the left shows how you can flush a pre-write buffer to the queue by modifying a single pointer. The arrow on the right shows how the entire contents of the queue can be moved to pre-read by modifying another pointer without doing anything.

Figure Nine: Lock-free queue

Lessons learned: The lock-free algorithm is difficult to invent, cumbersome to execute, and almost impossible to debug. If possible, use existing mature algorithms instead of inventing your own. When optimal performance is required, do not rely solely on the lock-free algorithm. Though they are fast, they pass above them

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More