Fast communication among lightning threads in Java

Source: Internet
Author: User
Tags lock queue intel core i7

This story comes from a very simple idea: to create a developer-friendly, simple and lightweight inter-thread communication framework, there is no need for locks, synchronizers, semaphores, waits, and notifications, develop a lightweight, lockless thread communication framework in Java, and there are no terminology or tools dedicated to queues, messages, events, or any other concurrency.

POJO communication is implemented with only regular old-fashioned Java interfaces.

It may be similar to Akka's typed actor, but as a new framework that must be super lightweight and optimized for a single multi-core computer, it may be a bit too much.

When actor spans the process boundaries of different JVM instances (on the same machine, or on different machines distributed on the Network), The Akka framework is good at handling inter-process communication.

However, for small projects that only require inter-thread communication, using Akka-typed actor may be a bit like killing a chicken with a knife, but the typed actor is still an ideal implementation method.

It took me a few days to use a dynamic proxy to block the queue and cache thread pool to create a solution.

Figure 1 shows the high-level architecture of the framework:

Figure 1:High-level architecture of the framework


SPSC queue refers to a single producer/single consumer queue. An MPSC queue is a multi-producer/Single-consumer queue.

The dispatch thread is responsible for receiving messages sent by the Actor thread and distributing them to the corresponding SPSC queue.

The Actor thread that receives the message calls the method in the corresponding actor instance with the data in it. With other actor proxies, the actor instance can send messages to the MPSC queue, and then the messages will be sent to the target actor thread.

I created a simple example to test the table tennis program:

Public interface PlayerA (void pong (long ball); // call the method you forgot to call after sending} public interface PlayerB {void ping (PlayerA playerA, long ball ); // call the method that you forgot after sending} public class PlayerAImpl implements PlayerA {@ Override public void pong (long ball) {}} public class PlayerBImpl implements PlayerB {@ Override public void ping (PlayerA playerA, long ball) {playerA. pong (ball) ;}} public class PingPongExample {public void t EstPingPong () {// The manager hides the complexity of Inter-thread communication // controls the actor proxy, actor implementation and thread ActorManager manager = new ActorManager (); // register an actor in the manager to implement the manager. registerImpl (PlayerAImpl. class); manager. registerImpl (PlayerBImpl. class); // create an actor proxy. The proxy converts a method call to an internal message. // It will be sent to a specific actor instance between threads. PlayerA playerA = manager. createActor (PlayerA. class); PlayerB playerB = manager. createActor (PlayerB. class); for (int I = 0; I <1000000; I ++) {playerB. ping (playerA, I );}}

After testing, the speed is about 500,000 beats/beats per second. However, I suddenly felt that the speed of the documentary thread was not that good. The code running in a single thread can reach 2 billion (2,681,850,373) per second )!

The difference is more than 5,000 times. So disappointing. In most cases, single-threaded code is more efficient than multi-threaded code.

I started looking for the reason and wanted to see why my table tennis players were so slow. After some research and tests, I found that it was a problem of blocking queues. The queue I used to transmit messages between actor affects the performance.

Figure2:Only one producer and one consumerSPSCQueue

So I started a competition and changed it to the fastest queue in Java. I found the blog of Nitsan Wakart. He published several articles to introduce the implementation of a single producer/single consumer (SPSC) Lock-free queue. These articles were inspired by Martin Thompson's speech on the ultimate performance of the lockless algorithm.

Compared with the queue based on private locks, the non-lock queue has better performance. In a lock-Based Queue, when a thread is locked, other threads will wait for the lock to be released. In lockless algorithms, other producer threads are not blocked when a producer thread produces messages, and consumers are not blocked by consumers of other read queues.

The performance of the SPSC queue described in Martin Thompson's speech and in Nitsan's blog is incredible-over 100 M ops/sec. It is 10 times faster than JDK's concurrent queue implementation (the performance on 4-Core Intel Core i7 is about 8 M ops/sec ).

With great expectation, I changed the chained blocking queues connected to all actor into lockless SPSC queues. Unfortunately, performance tests on throughput have not significantly improved as I expected. But soon I realized that the bottleneck was not in the SPSC queue, but in multiple producers/single consumers (MPSC.

The use of SPSC queues for MPSC queue tasks is not that simple; when performing put operations, multiple producers may overwrite each other's values. The SPSC queue does not have code to control the put operations of multiple producers. So I cannot solve my problem even if I switch to the fastest SPSC queue.

To handle multiple producers/single consumers, I decided to enable LMAX Disruptor, a high-performance inter-process message Library Based on the ring buffer.

Figure3:Single producer and single consumerLMAX Disruptor

With Disruptor, message communication between threads with low latency and high throughput can be easily realized. It also provides different use cases for different combinations of producers and consumers. Several threads can read messages in the circular buffer without blocking each other:

Figure4:For a single producer and two consumersLMAX Disruptor

The following is a scenario where multiple producers write messages to the circular buffer and multiple consumers read messages from the buffer.

Figure5:Two producers and two consumersLMAX Disruptor

After a quick search of performance tests, I found three publishers and one consumer for throughput testing. This is exactly what I want. It gives the following result:




Run 0

4,550,625 ops/sec

11,487,650 ops/sec

Run 1

4,651,162 ops/sec

11,049,723 ops/sec

Run 2

4,404,316 ops/sec

11,142,061 ops/sec


In three producers/one consumer scenarios, Disruptor is more than twice faster than javasblockingqueue. However, there is still a big gap between this and the expected 10-fold improvement in performance.

This frustrated me, and my brain has been searching for solutions. Just like being destined, I recently switched to the subway instead of having to get in and out of work. Suddenly, my brain began to map the station to the producer and consumer. In a station, there are producers (people who get off the bus) and consumers (people who get on the same car and get on the bus ).

I created the Railway class and used AtomicLong to track trains from the first stop to the next stop. I started with a simple scenario where there was only one vehicle on the rails.

Public class RailWay {private final Train train = new Train (); // stationNo tracks the Train and defines which station receives the private final AtomicInteger stationIndex = new AtomicInteger (); // multiple threads will access this method and wait for the Train public Train waitTrainOnStation (final int stationNo) {while (stationIndex. get () % stationCount! = StationNo) {Thread. yield (); // This is required to ensure high-throughput message transmission. // But it will consume CPU cycle while waiting for the train} // only the station number equals to stationIndex. get () % stationCount, this busy cycle will return train;} // This method moves the train to the next stop by adding the train site index public void sendTrain () {stationIndex. getAndIncrement ();}}

In order to test, the conditions I used are the same as those used in the Disruptor performance test, and the SPSC queue is also tested-the test passes the long value between threads. I created the following Train class, which contains a long array:

Then I wrote a simple test: two threads pass the long value to each other through the train.

Figure6:Single producers and consumers using a single trainRailway

Public void testRailWay () {final Railway railway = new Railway (); final long n = 20000000000l; // start a consumer process new Thread () {long lastValue = 0; @ Override public void run () {while (lastValue <n) {Train train = railway. waitTrainOnStation (1); // At #1, the train int count = train. goodsCount (); for (int I = 0; I <count; I ++) {lastValue = train. getGoods (I); // unload} railway. sendTrain (); // send the current train to the first stop }}}. start (); final long start = System. nanoTime (); long I = 0; while (I <n) {Train train = railway. waitTrainOnStation (0); // int capacity = train on the #0 train station. getCapacity (); for (int j = 0; j <capacity; j ++) {train. addGoods (int) I ++); // loads the cargo to the train} railway. sendTrain (); if (I % 100000000 = 0) {// measurement can be performed at intervals of M entries at a time using final long duration = System. nanoTime ()-start; final long ops = (I * 1000L * 1000L * 1000L)/duration; System. out. format ("ops/sec = %, d \ n", ops); System. out. format ("trains/sec = %, d \ n", ops/Train. CAPACITY); System. out. format ("latency nanos = %. 3f % n \ n ", duration/(float) (I) * (float) Train. CAPACITY );}}}

I was surprised when I ran this test under different train capacities:


Throughput: ops/sec

Latency: ns












742. 9








When the train capacity reaches 32,768, the throughput of messages sent by two threads reaches 767,028,751 ops/sec. It is several times faster than the SPSC queue in the Nitsan blog.

Continue to think about railway trains. I want to know what will happen if there are two trains? I think it should increase throughput and reduce latency. Each station has its own train. When a train loads at the first station, the second train will unload at the second station, and vice versa.

Figure7:Single producer and single consumer using two trainsRailway

The following is the result of throughput:


Throughput: ops/sec

Latency: ns




















The results are astonishing; they are more than 1.4 times faster than the results of a single train. When the train capacity is temporary, the delay is reduced from 192.6 nanoseconds to 133.5 nanoseconds, which is obviously an encouraging sign.

So my experiment is not over yet. The delay between two threads with a train capacity of 2048 for message transmission is 2,178.4 nanoseconds, which is too high. I am thinking about how to lower it and create an example with many trains:

Figure8:Single producers and consumers using multiple trainsRailway

I also dropped the train capacity to a long value and started to play with the number of trains. The test result is as follows:

Train Quantity

Throughput: ops/sec

Latency: ns

















The latency of sending a long value between threads by train 32,768 is reduced to 13.9 nanoseconds. By adjusting the Train Quantity and train capacity, when the delay is not so high and the throughput is not so low, the throughput and delay reach the optimal balance.

These values are great for a single producer and a single consumer (SPSC), but how can we make them take effect when there are multiple producers and consumers? The answer is simple. Add more stations!

Figure9: One producer and two consumersRailway

Every thread waits for the next train to load/unload, and then sends the train to the next stop. When the producer installs goods on the train, the consumer unload the goods from the train. The train is repeatedly transferred from one station to another.

To test the single producer/Multi-consumer (SPMC) situation, I created a Railway test with eight stations. One station belongs to one producer, and the other seven stations belong to consumers. The result is:

Train Quantity = 256, train capacity = 32:

Ops/sec =116,604,397Latency (nanoseconds) = 274.4

Train Quantity = 32, train capacity = 256:

Ops/sec =432,055,469Latency (nanoseconds) = 592.5

As you can see, even with 8 working threads, the test results are quite good-32 trains with a capacity of 256 long are capable of 432,055,469 ops/sec. During the test, the load on all CPU cores is 100%.

Figure10: In the test8StationRailwayPeriodCPUUsage

When playing with this Railway algorithm, I almost forgot my initial goal: to improve the performance of multiple producers/single consumers.

Figure11: Three producers and one consumerRailway

I created a new test for three producers and one consumer. Each train rotates one station and one station, and each producer only installs 1/3 capacity of goods for each vehicle. The consumer obtains all three items provided by the three producers of each vehicle. The average results of the performance test are as follows:

Ops/sec = 162,597,109 train/second = 54,199,036 delay (nanoseconds) = 18.5

The results are quite good. Producers and consumers are working at a rate of over 160 M ops/sec.

In order to fill in the difference, the Disruptor results in the same situation are given below-three producers and one consumer:

  Run 0, Disruptor=11,467,889 ops/sec  Run 1, Disruptor=11,280,315 ops/sec  Run 2, Disruptor=11,286,681 ops/sec  Run 3, Disruptor=11,254,924 ops/sec

Next is the Disruptor 3 P: 1C test of another batch message (each batch of 10 messages ):

  Run 0, Disruptor=116,009,280 ops/sec  Run 1, Disruptor=128,205,128 ops/sec  Run 2, Disruptor=101,317,122 ops/sec  Run 3, Disruptor=98,716,683 ops/sec;

Finally, the test results of Disruptor with javasblockingqueue in the 3 P: 1C scenario are as follows:

  Run 0, BlockingQueue=4,546,281 ops/sec  Run 1, BlockingQueue=4,508,769 ops/sec  Run 2, BlockingQueue=4,101,386 ops/sec  Run 3, BlockingQueue=4,124,561 ops/sec

As you can see, the average throughput of Railway is 162,597,109 ops/sec, while the best result of Disruptor is only 128,205,128 ops/sec in the same case. For LinkedBlockingQueue, the best result is only 4,546,281 ops/sec.

The Railway algorithm provides an easy way to significantly increase the throughput for event batch processing. By adjusting the train capacity or train quantity, you can easily achieve the desired throughput/delay.

In addition, when the same thread can be used to consume messages, process them, and return results to the loop, Railway can also be used to handle complex situations through mixed producers and consumers:

Figure12:Mixed producer and consumerRailway

Finally, I will provide an optimized ultra-high throughput single producer/single consumer test:

Figure13: For a single producer and a single consumerRailway

The average result is that the throughput exceeds 1.5 billion (1,569,884,271) operations per second, with a latency of 1.3 microseconds. As you can see, the same single-thread test result described at the beginning of this article is 2,681,850,373 per second.

Think about the conclusion.

I hope to write another article in the future to illustrate how to use the Queue and BlockingQueue interfaces to support the Railway algorithm to process different producer and consumer combinations. Stay tuned.

Via InfoQ

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.