Concurrent framework Disruptor analysis, framework disruptor Analysis

Last Update:2014-10-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Introduction

Disruptor is an open-source Java framework designed for obtaining as high throughput (TPS) and as low latency as possible on the producer-consumer problem (PCP) issue. Disruptor is a key component of the LMAX online transaction platform. The LMAX platform uses this framework to process orders at a speed of up to 6 million TPS. In addition to the financial sector, Disruptor can be used in other general applications, it can significantly improve the performance. In fact, Disruptor is not so much a framework as a design idea. This design idea is for programs with "concurrency, buffer, producer-consumer model, transaction processing" elements, disruptor proposes a scheme to greatly improve the performance (TPS.

Many people have written articles about Disruptor, but I still want to write this analysis. After all, different people have different understandings, I hope that those who have never touched it will have a preliminary understanding of Disruptor through this article. This article will provide some links for your reference.

2. What is Disruptor? Why is the speed faster?

In short, Disruptor is a high-performance Buffer and provides a framework for using this Buffer. Why is it better? This should begin with the disadvantages of PCP and traditional solutions.

We know that the core of PCP, also known as Bounded-Buffer, is to ensure that access to a Buffer is free of errors in a multi-threaded environment. Using the ArrayBlockingQueue and javasblockingqueue classes in Java can easily complete the PCP model. This is no problem for general programs, but it is not applicable to systems with high concurrency and high TPS requirements.

* BlockingQueue uses package java. util. concurrent. locks. When multiple threads (such as producers) write data to the Queue at the same time, the lock contention will lead to only one producer being able to execute and other threads being interrupted, that is, the state of the thread switches from RUNNING to BLOCKED until a producer thread releases the lock after using the Buffer. The status of other threads switches from BLOCKED to RUNNABLE, then the time slice goes to other threads before the lock competition. In the above process, it usually takes a very short time for the producer to store a data in the Buffer, and the speed for the operating system to switch the thread context is also very fast. However, when the number of threads increases, the overhead of OS switching threads increases gradually, and repeated lock application and release become performance bottlenecks. * In addition to the performance loss caused by the use of locks, BlockingQueue may also cause secondary performance loss due to the order of thread contention: in actual use, it is found that the thread scheduling sequence is not ideal, the operating system may frequently schedule producers or consumers in a short period of time, which may result in extreme situations where the buffer may be filled or cleared in a short period of time. (Ideally, the buffer length is moderate, and the production and consumption speeds are basically the same)

The solution for the above problem Disruptor is: no locks are needed.

CAS (Compare and Swap/Set) is A general instruction supported by the CPU (for example, cmpxchg class instruction). CAS operations include three operands: CAS (A, B, c). Its function is to compare the value of address A with that of address B. If the value is the same, value C to address. CAS features that it is a very lightweight command implemented by hardware, and the CPU also guarantees the atomicity of this operation. You can use the boolean compareAndSwapInt (java. lang. object arg0, long arg1, int arg2, int arg3); series of methods, for an int variable (such as Ring Buffer write pointer), using CAS can avoid confusion caused by multi-thread access, when the compareAndSwap method is true, the CAS operation is successfully assigned A value. If the return value is false, the value at address A is not equal to B. Then try again. The logic for moving the write pointer using CAS is as follows:

1 // write the pointer to move n 2 public long next (int n) 3 {4 //...... 5 long current, next; 6 do 7 {8 // here, first back up the current value of the write pointer 9 current = pointer. get (); 10 // The expected position to move the write pointer to 11 next = current + n; 12 //...... omitted: Make sure that the Slot from current to current + n has been read by the consumer ...... 13 // * atomic operation * if the current write pointer is the same as the previous one (indicating that the calculation of rows 9-12 is valid), move the write pointer 14 if (pointer. comapreAndSet (current, next) 15 break; 16} while (true) // If CAS fails or cannot move the write pointer, continue to try 17 return next; 18}

Okay, now we have a Ring Buffer using CAS, which is much faster than the lock, but CAS is not as efficient as we think. According to the evaluation in [2] pdf of the link: compared with a single thread without a lock to execute a simple task, the lock time is two orders of magnitude higher than that without a lock, and CAS is also an order of magnitude higher. So what else does Disruptor do to improve performance? The following lists other performance optimization points except lock-free programming.

　　①Cache Line Padding: the CPU Cache usually uses 64 bytes as the size of a Cache row. The Cache is composed of several Cache rows, and the Cache is written back to the primary or the primary to the Cache, in addition, each CPU core has its own cache (if a core modifies a cache row, other core with the same cache needs to be synchronized ), the pointer of the producer and consumer is represented by the long type. If there is only one producer and one consumer, there is no direct connection between the pointer of both parties, as long as they are not "squatting ", you can change each pointer. Okay, it's a bit messy, but it's a premise. The following question is: What if the pointer of the producer and consumer (16 bytes in total) appears in the same cache row? For example, if the consumer of CPU core A modifies its pointer value (P1), all the cache lines cached for P1 in other cores will become invalid and be redeployed from the primary memory. The disadvantage of doing so is obvious, but the CPU and compiler are not smart enough to avoid this problem, so cache row filling is required. Although the cause of the problem is very difficult, the solution is very simple: For a long-type buffer pointer, it is replaced by a long array with a length of 8. As a result, a cache row is filled with this array, and the modifications made by the thread to their respective pointers will not interfere with others.

　　②Avoid GC: When writing a Java program, many people get used to new objects. Although Java GC is responsible for collection, however, when the system is under high pressure, frequent new operations will inevitably lead to more frequent GC. The policy of Disruptor to avoid this problem is to allocate in advance. When creating a RingBuffer instance, the parameter requires that a Factory of the Buffer element type be provided. When creating an instance, the Ring Buffer will first fill the entire Buffer zone with the instance generated by the Factory, and when the producer produces the following, instead of using the traditional method (by hand, a new instance is released and then added to the buffer), it obtains the new instance and then sets the value. For example, if the buffer zone contains a lot of paper and information is recorded on it, the previous practice is: every time you add a buffer, prepare a piece of paper from the system, and then write the paper into the buffer zone. After consumption, discard it. The current practice is: to prepare all the pieces of paper, you only need to erase the original information and write a new one.

　　③Batch operation (Batch): the core operation of the Ring Buffer is production and consumption. If you can reduce the number of these two operations, the performance will inevitably increase accordingly. In Disruptor, Batch operations are used to reduce the number of times of production and consumption. The following describes how to embody Batch in the production and consumption processes of Disruptor. When producing things to RingBuffer, two phases are required: Phase 1 is the application space. After the application, the producer obtains a pointer range [low, high], and then [low, high] all objects in this section are setValue (see optimization point ②), and phase 2 is release (like this ringBuffer. publish (low, high );). After Stage 1 ends, if another producer applies again, another buffer zone will be obtained. After Phase 2 ends, the data you applied for can be read by the consumer. Disruptor recommends batch production and batch release to reduce performance loss caused by synchronization during production. Two stages are required for consuming things from RingBuffer. Phase 1 is waiting for the producer's (write) pointer value to exceed the specified value (N, that is, data before N has been consumed ), after Stage 1 is executed, the consumer will get a pointer value (R), indicating that the value before subscript R in Ring Buffer can be read. Phase 2 refers to the specific reading (omitted ). The return value of Phase 1 R is likely to be greater than N. In this case, the consumer should perform batch read operations to process all data in the range [R, N.

　　④LMAX architecture: (Note: it refers to a collection of design ideas used by LMAX when it is engaged in their trading platform. Strictly speaking, the LMAX architecture includes Disruptor, not part of it, however, the design of Disruptor is more or less a reflection of these ideas. Therefore, we need to mention that there should be a lot to write about the LMAX architecture, but it is limited to the individual level. Here we can only briefly talk about it. In addition, this architecture is the product of extreme performance pursuit and is not necessarily suitable for the public .) As shown in, the LMAX architecture is divided into three parts: the input/output Disruptor and the intermediate core business logic processor. All information Input enters the Input Disruptor, which is read by the business logic processor and then sent to the Output Disruptor, and finally Output to other places.

3. Hello Disruptor

Disruptor was initially implemented by Java and now has C/Cpp and. Net versions. The Java version is the most up-to-date and fastest, and the code comments are more understandable. With so much said, this section first provides a test example to demonstrate the basic usage of Disruptor. In this example, we use javasblockingqueue and Disruptor To test the single producer + single consumer access to simple objects, the time consumed by both parties is counted for reference only.

In this example, Disruptor 3.2.1 is used. Some Terms of Disruptor may change between different versions. In this version, the elements in the buffer zone are called events, the pointer (subscript of the buffer zone) is called Sequence, and the pointer of the producer is RingBuffer. sequencer (private member), consumer pointer through ringBufferInstance. newBarrier. // Simple object: the element in the buffer, which has only one value, provides setValueprivate class TestObj {public long value; public TestObj (long value) {this. value = value;} public void setValue (long value) {this. value = value ;}} public class Test {// number of objects to be produced final long objCount = 1000000; final long bufSize; // buffer size {bufSize = getRingBufferSize (objCount );} // obtain the buffer size of RingBuffer (power of 2! Static long getRingBufferSize (long num) {long s = 2; while (s <num) {s <= 1 ;}return s ;} // use javasblockingqueue to test public void testBlocingQueue () throws Exception {final javasblockingqueue <TestObj> queue = new javasblockingqueue <TestObj> (); Thread producer = new Thread (new Runnable () {// producer @ Override public void run () {try {for (long I = 1; I <= objCount; I ++) {queue. put (new TestObj (I); // production} catch (InterruptedException e) {}}); Thread consumer = new Thread (new Runnable () {// consumer @ Override public void run () {try {TestObj readObj = null; for (long I = 1; I <= objCount; I ++) {readObj = queue. take (); // consume // DoSomethingAbout (readObj) ;}} catch (InterruptedException e) {}}); long timeStart = System. currentTimeMillis (); // statistical time producer. start (); consumer. start (); consumer. joi N (); producer. join (); long timeEnd = System. currentTimeMillis (); DecimalFormat df = (DecimalFormat) DecimalFormat. getInstance (); System. out. println (timeEnd-timeStart) + "/" + df. format (objCount) + "=" + df. format (objCount/(timeEnd-timeStart) * 1000);} // use RingBuffer to test public void testRingBuffer () throws Exception {// create a single-producer RingBuffer, eventFactory is the object factory that fills the buffer // YieldingWaitStrategy and so on "waiting policy "Indicates the policy final RingBuffer <TestObj> ringBuffer = RingBuffer before the consumer waits for data to become available. createSingleProducer (new EventFactory <TestObj> () {@ Override public TestObj newInstance () {return new TestObj (0) ;}}, (int) bufSize, new YieldingWaitStrategy ()); // create the consumer pointer final SequenceBarrier barrier = ringBuffer. newBarrier (); Thread producer = new Thread (new Runnable () {// producer @ Override public void run () {for (long I = 1; I <= o BjCount; I ++) {long index = ringBuffer. next (); // apply for the next buffer Slot ringBuffer. get (index ). setValue (I); // assign ringBuffer to the applied Slot. publish (index); // publish, and then the consumer can read it}); Thread consumer = new Thread (new Runnable () {// consumer @ Override public void run () {TestObj readObj = null; int readCount = 0; long readIndex = Sequencer. INITIAL_CURSOR_VALUE; while (readCount <objCount) // read the objCount elements and end with {try {long nextInd Ex = readIndex + 1; // The currently read pointer + 1, that is, the next read location long availableIndex = barrier. waitFor (nextIndex); // wait until the above position can be read while (nextIndex <= availableIndex) // from the next readable position to the currently readable position (Batch !) {ReadObj = ringBuffer. get (nextIndex); // get the object in the Buffer // DoSomethingAbout (readObj); readCount ++; nextIndex ++;} readIndex = availableIndex; // refresh the currently read location} catch (Exception ex) {ex. printStackTrace () ;}}}); long timeStart = System. currentTimeMillis (); // statistical time producer. start (); consumer. start (); consumer. join (); producer. join (); long timeEnd = System. currentTimeMillis (); DecimalFormat df = (DecimalFormat) DecimalFormat. getInstance (); System. out. println (timeEnd-timeStart) + "/" + df. format (objCount) + "=" + df. format (objCount/(timeEnd-timeStart) * 1000);} public static void main (String [] args) throws Exception {Test ins = new Test (); // execute the test ins. testBlocingQueue (); ins. testRingBuffer ();}}Test code

Test results:

319/1, 000,000 = 3,134,000 // use LinkedBlockingQueue to retrieve 319 simple objects in 1 million Ms memory, and execute 3.13 million simple objects per second

46/1, 000,000 = 21,739,000 // use the Disruptor To retrieve 1 million simple objects in 46 Ms memory and execute 21.73 million simple objects per second

On average, the speed of Disruptor can be increased by seven times. (Results may be inconsistent in different computer and application environments)

4. Whatever you want: Disruptor, completion port and Mechanical Sympathy

When pushing performance like this, it starts to become important to take account of the way modern hardware is constructed.

　　　　　　　　　　　　　　　　　　　　　　—The LMAX Architecture

"When the pursuit of performance reaches this level, the understanding of the modern hardware structure becomes more and more important ." This statement properly describes the performance pursuit and failure of Disruptor/LMAX. Failed, failed? Why? Disruptor is of course an excellent framework. I am talking about failure refers to the LMAX executor trying to improve the efficiency of concurrent programs, optimize, use locks, or use other models during its development, but these attempts eventually failed-then they built the Disruptor. Another question: do Java Programmers need to know a lot about hardware when trying to improve the performance of their programs? I think many people will answer "no". In the process of building a Disruptor, the developer may not need to answer this question at first, but they decided to create a new path after the attempt fails. In general, let's take a look at the Disruptor design: locking to CAS, filling in buffer lines, and avoiding GC. I feel these designs are all deliberately "accommodating" or "dependent" hardware design, these designs are more like a "(ugly) hack" (there is no doubt that Disruptor is one of the best solutions ).

Disruptor I thought of the finished port, which is said to be the fastest concurrent network "Framework" on Windows: you only need to use the API to tell Windows which sockets Do You Want To recv, then each recv operation is executed at the kernel level and added to a queue. Finally, the Worker thread is used for processing. Most of the work is done for you in Windows, there are no context switches and a large number of threads without locks. Is it the same as Disruptor? When pursuing performance, both the completion port and Disruptor should avoid the use of parallel, lock, multithreading and other concepts. These concepts have their own reasons. I don't need to talk about them here, but for performance (considering hardware) however, they cannot be fully used, which means that the development of hardware and software is not coordinated in terms of processing concurrency and parallelism. Maybe Feng's computer is suitable for processing information in a single "Thread" sequence. With regard to this kind of inconsistency, I think the hardware should gradually adapt to the software, but some people have suggested an interesting Mechanical Sympathy (link [6]). as for how it will develop in the future, it will not be discussed in this blog :).

(End)

Link:

Description: http://ifeve.com/disruptor/

Original https://code.google.com/p/disruptor/wiki/BlogsAndArticles

[2] GitHub of Disruptor:Http://lmax-exchange.github.io/disruptor/

WhereHttp://lmax-exchange.github.io/disruptor/files/Disruptor-1.0.pdfThis PDF provides a good description of Disruptor.

[3] completion port: http://blog.csdn.net/piggyxp/article/details/6922277

[4] tribute to disruptor: CAS for efficient (pseudo) Lock-free blocking queue practices: http://www.majin163.com/2014/03/24/cas_queue/

[5] Disruptor source code analysis: http://huangyunbin.iteye.com/blog/1944232

[6] Mechanical Sympathy: http://mechanical-sympathy.blogspot.com/

-------------------------------- (I am a split line )----------------------------------

PS: This is my 5th blogs (but the first four are not technical blogs, and I will try to post some problems and think about them in the future. The level is limited, you are welcome to point out the shortcomings!

Java concurrency frame with disruptor instance

Hi.baidu.com/..3e83b1

LongEvent. java
Public class LongEvent {private long value; public void set (long value) {this. value = value ;}} LongEventFactory. java
Import com. lmax. disruptor. eventFactory; public class LongEventFactory implements EventFactory <LongEvent> {public static final LongEventFactory INSTANCE = new LongEventFactory (); public LongEvent newInstance () {return new LongEvent () ;}} LongEventHandler. java
Import com. lmax. disruptor. eventHandler; public class LongEventHandler implements EventHandler <LongEvent> {public void onEvent (LongEvent event, long sequence, boolean endOfBatch) {System. out. println ("Event:" + event) ;}} LongEventProducer. java
Import java. nio. byteBuffer; import com. lmax. disruptor. ringBuffer; public class LongEventProducer {private final RingBuffer <LongEvent> ringBuffer; public LongEventProducer (RingBuffer <LongEvent> ringBuffer) {this. ringBuffer = ringBuffer;} public void onData (ByteBuffer bb) {long sequence = ringBuffer. next (); // Grab the next sequence try {LongEvent event = ringBuffer. get (sequence); // Get the entry in the Disruptor // for the sequence event. set (bb. getLong (0 ));//...... remaining full text>

How does an LMAX Disruptor work? (Stackoverflow's answer)

It can be used to replace queues and has many features of SEDA and Actors. Compared with the queue, Disruptor can send messages to other threads and wake up other threads (similar to BlockingQueue) as needed ). However, there are three major differences between them. 2. It takes two steps to put a message into the Disruptor. First, an interface is called in the ring buffer. This interface provides the user with an Entry that can store the correct data. Then you must submit this entry and use the memory flexibly as mentioned above. These two steps are required. The submit operation allows the consumer thread to read messages. 3. Messages consumed in the ring buffer should be tracked by consumers. Preventing ring buffer from tracing messages can reduce the occurrence of write conflicts because each thread maintains its own counters. Compared with Actors, the Actor model is the program model closest to the Disruptor, especially the BatchConsumer/BatchHandler class. These classes hide all complex implementations that keep consumed serial numbers and provide some simple callback methods when important events occur. But there are two small differences. 1. disruptor uses a thread to correspond to a consumer model, while Actors uses many-to-many models. For example, you can have as many Actors as possible, they are distributed on a certain number of threads (generally one core and one actor ). 2. the BatchHandler interface provides an additional (and important) callback method onEndOfBatch (). it allows time-consuming operations, such as placing I/O operations in batches for execution to increase throughput. Other Actor frameworks can also be used for batch processing, but almost none of them provide a callback method for the completion of batch processing. You can only use timeout to determine whether the batch processing is complete, resulting in poor latency. Compared with SEDA, LMAX designs the Disruptor mode to replace SEDA. 1. Compared with SEDA, the main improvement of disruptor is that it can work in parallel. Disruptor supports multicast messages to implement this function. The same messages (in the same order) are sent to multiple consumers. This avoids cross layers in pipelines. 2. Consumers can rely on the processing results of other consumers to place a queue layer between them. A consumer can see the serial number of the consumer on which it depends. This avoids merging layers in the pipeline. Compared with memory failure, dispruptor can be considered as a structured and ordered memory failure. The producer barrier is equivalent to the write barrier and the consumer barrier is equivalent to the read barrier. The second answer (answered Jul 16' 11 at irreputable): First, let's take a look at the programming model it provides. It has one or more authors and readers. There is a row from old to new (left to right ). The author can add entries on the right. Each reader reads entries from left to right. The reader obviously cannot skip the first read by the author. Entries cannot be deleted. I use "Reader" instead of "consumer" to avoid consuming items. However, we know that the entries on the left of the last reader are useless. Generally, the reader can read entries independently concurrently. However, we can declare the dependencies between readers. Readers can have any non-circular dependency. If reader B depends on reader A, reader B cannot skip reader A to read first. Reader A can annotate an entry, while reader B depends on this annotation, so there is A dependency between readers. For example, A performs some calculations on an entry and then saves the results to field a in the entry. Then A advances, and then B can read a from this entry and the entry. If reader C does not depend on A, then C should not read. Of course, the main goal of LMAX is performance. Disruptor uses a pre-allocated entry ring. This ring is large enough, but there is an upper limit, so it will not exceed the capacity. If the ring is full, the author will wait until the slowest author advances to free up space. The entry object is pre-allocated and always exists to reduce GC consumption. We will not add new... remaining full text>

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Concurrent framework Disruptor analysis, framework disruptor Analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Concurrent framework Disruptor analysis, framework disruptor Analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support