Delayedoperationpurgatory's Timer

Last Update:2015-12-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Purgatory Time-out detection

When a delayedopeartion time-out (timeout), it needs to be detected and then call its callback method. It seems simple, but it's not easy to do it well.

The implementation of the 0.8.x Kafka is straightforward, but inefficient. These versions of the Kafka delayed request implement the Delayeditem interface required by Java.util.concurrent.DelayQueue. These requests are put in a delayqueue, and then a dedicated thread poll these requests out of the delayqueue, so the element that is poll out is the element that has expired.

The disadvantage of this is that the time complexity of inserting and deleting in Delayqueue is O (Logn), and when there are many elements, it is very CPU-consuming. And in this implementation, when a request is satisfied, it is not immediately removed from the Delayedqueue (because the time complexity of deleting a particular element is O (n)), but rather a certain number of requests are traversing the queue, removing the element from it ( The cost of doing so is less than removing specific elements individually. Therefore, if the purge interval is not set properly, it may oom~

0.9.0, Purgatory has adopted a new timing wheel-based implementation, which can be found in the previous blog Kafka Purgatory (translation)

Let's take a look at 0.9.0 's source code for the implementation of timeout detection.

Timingwheel

Timingwheel's principle can be seen in this article amazing time wheel timer. Kafka's Purgatory redesign proposal (translation) also mentions the principle of its realization.

The implementation of Kafka is broadly similar to the general timing wheel, but combines some of the features of Kafka use, such as the use of delayqueue to drive the time wheel, and the implementation of the bucket with a double-ended list.

Examples in the comments

The comments in the Timingwheel class also describe its rationale, and give an example of how it can be easily understood. Here is a look at the examples mentioned in the comments. But in a general look at what is going on, the logic of the comment itself has some problems (if not I understand it wrong). And this example just illustrates the concept.

U represents the minimum time granularity, and n represents the size of the time wheel, that is, how many buckets a time wheel has. Set U equals 1, equals 3, start time is C. At this point, the buckets at different levels are the following:
* Level    Buckets
* 1        [c,c]   [c+1,c+1]  [c+2,c+2]
* 2        [c,c+2] [c+3,c+5]  [c+6,c+8]
* 3        [c,c+8] [c+9,c+17] [c+18,c+26]
Bucket time-out (expire), based on the start time of the bucket. So at c+1 time, [C, C], [C, c+2] and [C, c+8] all timed out.

Level 1 clock moved to C+1, and created [C+3, c+3]

The clocks of Level 2 and Level 3 stop at C, because their clocks move in units of 3 and 9, respectively. Therefore, levels 2 and 3 do not have new buckets created.

It should be stated that level 2 [C, c+2] will not receive any task because the interval has been overridden by level 1. The same is true for Level 3 of [C, C+8], because this interval is covered by Level 2. This is a waste, but it simplifies implementation.
* 1        [c+1,c+1]  [c+2,c+2]  [c+3,c+3]
* 2        [c,c+2]    [c+3,c+5]  [c+6,c+8]
* 3        [c,c+8]    [c+9,c+17] [c+18,c+26]
At the c+2 moment, [C+1, c+1] becomes timed out. The clock for Level 1 went to c+2 and created [C+4, C+4].
* 1        [c+2,c+2]  [c+3,c+3]  [c+4,c+4]
* 2        [c,c+2]    [c+3,c+5]  [c+6,c+8]
* 3        [c,c+8]    [c+9,c+17] [c+18,c+18]
At c+3 time, [C+2, c+2] became timed out, Level 2 moved to C+3, and created [C+5, c+5] and [C+9, c+11]
* 1        [c+3,c+3]  [c+4,c+4]  [c+5,c+5]
* 2        [c+3,c+5]  [c+6,c+8]  [c+9,c+11]
* 3        [c,c+8]    [c+9,c+17] [c+8,c+11]

Problems to be considered in designing Timingwheel

The overall design is according to the proposal mentioned, but some details need to consider

The definition of something must be clear first:

Time unit for a timingwheel, which walks a lattice corresponding to the physical times defined for this timingwheel of the timing unit. In the source code, tickms this timingwheel constructor parameter determines its time unit.
Bucket A bucket represents a time period, its end time-the start time + 1 =the time unit, the start period (physical time) is always an integer multiple of the timing unit. A timingwheel consists of a fixed number of buckets that represent the time periods of each other and completely cover the entire time period of the timingwheel current representation.
the size of a timingwheel is how many buckets it contains, that is, the time it takes to walk around/time unit.
Current time the beginning of the bucket that the Timgwheel pointer is pointing to. So current time is always an integer multiple of the time unit.

In fact, the concepts above can be defined differently, and these definitions also determine a specific implementation of Timingwheel. The definition mentioned above is consistent with the behavior of Timingwheel in the source code.

Bucket of the current time expire

With these concepts, you can determine one thing, which is a somewhat confusing concept in the notes mentioned above: Do you think the bucket that current time refers to is expire? This concept also determines when the timingwheel tick, whether the new point bucket becomes outdated, or the previous bucket becomes outdated. In a hierachical timing wheel, after the advanced bucket expires, the elements inside it need to be reinserted into the lower timing wheel to ensure the entire timer (so these timing wheel form a timer) The timing accuracy. This determines that a bucket of expire time is the start time of the bucket. According to the definition of current time, this also determines that the bucket currently pointed to by a timingwheel is an already expire bucket, that is, when the wheel moves around, the bucket that it points to becomes outdated.

In Kafka, buckets correspond to the Timertasklist class.

When will overflow

A timing wheel has a time period in which it can host. This time period is determined by several parameters

Currentime Current Time
tickms the size of each bucket, the unit of physical time in Kafka is milliseconds. So, tickms refers to how many milliseconds a bucket has.
Wheelsize How many buckets of this timing wheel have

Based on these parameters, the expiration time of a request that a timing wheel can hold is distributed in [CurrentTime, CurrentTime + tickms * wheelSize-1], containing both ends of the time.

When a request is added to a timing wheel, and the request expire time exceeds the right end of the upper period, it needs to overflow to a higher timing wheel.

Put elements in the appropriate bucket according to expire time

The key to this operation is to control its time complexity, which can actually be divided into two steps (this is a bit like putting an elephant in the freezer for a few steps ...) ：

Find the right bucket
Put the item in the bucket

Find the right bucket

The number of buckets per timing wheel is fixed, which is more suitable for building a bucket array to hold buckets, and the array is more addressable. The implementation in Kafka is the array used.

So what is the bucket that is known to have an item that is known to have its expire time (and no overflow)?

If we take the entire time domain according to this timing wheel tickms countless buckets, the first bucket start time is the Unix epoch of 0 milliseconds, then can be based on expire time to find the item should be placed in the bucket number, that is expire time/ tickms.

For example, the current bucket in the timing wheel is numbered 0, 1, 2, 3. CurrentTime is 0.

After the tick, the No. 0 bucket can be reused, and we can naturally put the number 4th in the bucket. At this point, the bucket number for this timing wheel is 4, 1, 2, 3. Still conform to the boundary of a timing wheel we mentioned earlier. Again, after a tick, the number of the timing wheel bucket becomes 4, 5, 2, 3. This way, equivalent to the value of each bucket can be placed in the array of its index + n * wheelsize (n is a non-negative integer).

This has the formula for determining the index of the bucket according to Expiretime: (expiretime/tickms)% wheelsize.

      // Put in its own bucket      Val virtualid = expiration/ tickms      = buckets ((virtualid% wheelsize.tolong). ToInt)      Bucket.add ( Timertaskentry)      //  Set The bucket expirationtime      if ( Bucket.setexpiration (Virtualid * tickms)) {        queue.offer (bucket)      }

It can also be seen here that a bucket of expirationtime is virtualid * tickms, which is its starting time.

Note that this formula does not violate our previous definition of the tickms wheelSize-1 rule only if the time for all elements in each timing wheel is guaranteed to be in [currenttime, currenttime + timing * wheel].

Put it in a bucket.

In Kafka, buckets are implemented with a double-ended list, so the cost of the Insert element is O (1).

The operation of a timer

The above conventions for timing wheel only define a data structure and its rules of operation. But timing wheel itself does not change with the progress of physical time, it is said that it needs external drive.

How to Drive

For timing wheel, this move is its current time forward. As the current time changes, there will be buckets that expire, and others can take them out of the timing wheel (there is a need to consider the synchronization problem, which is that the tick needs to hold the timing wheel lock). Changing the current time, the implementation of Kafka corresponds to the advance method of the timing wheel.

Now there are two issues that need to be addressed:

How to make timingwheel move around according to physical time. In front of the Purgatory redesign proposal, a simple way to wake up periodically with a thread is to drive the Timingwheel forward, but the downside of doing this is when the element in the timing wheel is (exactly, a non-empty bucket). When it is sparse, it is a waste to periodically wake up the thread to check. So Kafka uses a "On demand" way to wake up threads, which is delayqueue. Each bucket implements the delay interface, and Getdelay (Gets the timeout) returns the bucket's expire time, which is its start. The expiredoperationreaper thread blocks itself through the Delayqueue poll method (the current implementation is up to 200 milliseconds, because the default lowest-level wheel tickms is 1ms, so this timeout time is acceptable ), when there is a bucket expire, it will take it out of the delayqueue. Kafka's Expiredoperationreaper DoWork method (which will be called by the thread Cheng) is this
```
    override def doWork () {      timeouttimer.advanceclock (200L)        Estimatedtotaloperations.getandset (delayed)        debug ("Begin Purging watch Lists")        =  Allwatchers.map (_.purgecompleted ()). Sum        debug ("purged%d elements from watch lists." . Format (purged)      }
```
It will call Timeouttimer's Advanceclock method to change the timer's current time (not quite right, there are a lot of things to do). Yes, it's not just about changing the current time. This is simple, changing the timer's current time must be one with the bucket that handles timeouts. and Advanceclock (200L) is not timing wheel forward 200ms meaning, but passed in a 200ms time-out .... The Advanceclock method of the timer actually grows like this, the feeling is called "processexpiredbucketsandadvanceclock" more reliable point ~
```
def advanceclock (timeoutms:long): Boolean ={var bucket=Delayqueue.poll (TIMEOUTMS, timeunit.milliseconds)//Remove time-out buckets, up to block TIMEOUTMS, current implementation is 200msif(Bucket! =NULL{Writelock.lock ()//Hold write Lock (write lock in Reentrantreadwritelock), start processing time-out bucketTry {         while(Bucket! =NULL{Timingwheel.advanceclock (bucket.getexpiration ())//Change the current time of the timing wheel to the expire of this time-out bucket (b) Ucket.flush (reinsert)//handle the bucket element bucket=delayqueue.poll ()//Continue to poll other timeout elements}} finally{Writelock.unlock ()//release Lock}true    } Else {      false    }  }
```
The point here is confusing is Timingwheel.advanceclock (bucket.getexpiration ()) This sentence, it timing wheel's current time is set to the bucket expire times. Why is this time? Because the bucket of expire time is its start, because poll always take out the latest out of date buckets, so poll out of the bucket expire time is basically the current physical times. So after this setting, when Writelock is released, the timer inserts a new element into the appropriate bucket based on the physical time of the element. If the curernt time of the bucket is too large, the timer's timing is not allowed (the timer's accuracy is analyzed in the back).
How is the expired bucket handled? Since the Kafka bucket is an element in an array, it is necessary to complete the callback function after the expiration of all elements in the lock that holds the timing wheel, emptying the bucket, and then releasing the lock, unless all elements inside it are copied out after expiration. Therefore, if the thread thread the drive timer is calling the callback function of these elements (such as returning a response), then the drive thread can hold the lock for a long time, and it cannot drive the timer forward until it finishes processing an element, because the processing time is not controllable, So the timer is not allowed. So Kafka uses a separate thread pool to execute callbacks.
```
Private def addtimertaskentry (timertaskentry:timertaskentry): Unit = {    if (!  Timingwheel.add (Timertaskentry)) {      //  already expired or cancelled      if (!  timertaskentry.cancelled)        taskexecutor.submit (timertaskentry.timertask)    }  }
```
When Timingwheel.add returns False, it represents the Timertaskentry (that is, an element in the bucket that holds a timertask,delayedoperation inherits TimerTask) If it hadn't been for the expire, it would have been cancel. If it is expire, it submits the timertask it holds to the Taskexecutor Executorserivce (TimerTask implements the Runnable interface). For Delayedoperation, that is, its forcecomplete callback will be executed, and for Delayedfetch and delayedoperation, that will generate and send a response.
For expired buckets, divided into two categories: 1. The item inside is out of date and needs to be submitted to Taskexecutor execution 2. This is a high-level timing Wheel, so the elements inside need to be placed in low-level timing Wheel. Kafka is a unified abstraction of the two, that is, a reinsert method to complete the upper processing, and reinsert called the upper Addtimertaskentry method.
```
Private [This] val reinsert = (timertaskentry:timertaskentry) = Addtimertaskentry (timertaskentry)
```
The Addtimertaskentry method invokes the Timingwheel Add method, which handles different processes depending on the state of the Taskentry
```
def add (timertaskentry:timertaskentry): Boolean ={val Expiration=TimerTaskEntry.timerTask.expirationMsif(timertaskentry.cancelled) {//Cancelled      false    } Else if(Expiration < CurrentTime +tickms) {      //already expired      false    } Else if(Expiration < CurrentTime +interval) {      //Put in its own bucket      ...      true    } Else {      //Out of the interval. Put it into the parent timer      if(Overflowwheel = =NULL) Addoverflowwheel () Overflowwheel.add (Timertaskentry)}} 
```

The accuracy of the timer

Ideally, the timer should detect that it timed out in a timertask expiration time and then execute the callback. But is that the case? This is due to the following points:

What drives it is the logic mentioned earlier, rather than the direct invocation of the JDK's time-related approach (like Thread.Sleep). Then we need to see if these logic can make the timer accurate.
The timer itself is accurate, that is tickms.

Suppose a bucket is poll out of the physical time t, and T is exactly equal to the bucket's expiration time, which is the start of it. Bucket expiration time is determined by the Add method of the Timingwheel

      Val virtualid = expiration/ tickms      = buckets ((virtualid% wheelsize.tolong). ToInt)      Bucket.add ( Timertaskentry)      //  Set The bucket expirationtime      if (bucket.setexpiration (Virtualid * tickms)) {        Queue.offer (bucket)      }

So, the expiration time of the bucket that the timertask was put into is actually determined by the TimerTask's expiration, which is a physical moment. Thus, a timertask is always poll out at the beginning of the global bucket (that is, the bucket with ID virtualid) that belongs to the time of the Unix epoch 0. But being poll does not mean that the TimerTask callback will be executed, and the actual execution time depends on what to do.

The elements that are poll out of the bucket will go through the Timingwheel#add method, Timer#addtimertaskentry-Timer#reinsert, to decide what to do. This process is mentioned earlier, with the emphasis being that only the Add method is considered expire and timertask to be committed to the thread pool execution. The Add method determines whether the TimerTask expires based on current time.

Val expiration =TimerTaskEntry.timerTask.expirationMsif(timertaskentry.cancelled) {//Cancelled      false    } Else if(Expiration < CurrentTime +tickms) {      //already expired      false    } Else if(Expiration < CurrentTime +interval) {      //Put in its own bucket

Here is the problem of inconsistent local time between physical time and timer, that is, the difference between expiration and currenttime.

CurrentTime when the poll out a bucket after the determination, logic is in the timer Advanceclock method, the front also mentioned.

    if NULL ) {      writelock.lock ()      try  {        whilenull) {          Timingwheel.advanceclock (Bucket.getexpiration ())          Bucket.flush (reinsert)          = Delayqueue.poll ( )        }

So the local time of the bucket is always behind the physical time, and how much lag depends on the difference between the time the poll returns and the bucket's expiration time, as well as when the write lock is acquired. It is important to note that this timingwheel is the lowest level of the Timingwheel, and the Advanceclock method will update all of its upper Timingwheel currenttime.

The reinsert method has updated the Timingwheel currenttime before execution, and Timertasklist's Flush method performs timertask for all Resinsert, The flush method does not detect the time-out of the timertask in this timertasklist because they belong to the same tickms, which is due to the error caused by tickms.

The problem now is, if poll out is a high-level timing that wheel in the bucket, then the next processing will not bring error.

Assuming the lowest level of Timingwheel tickms is 1ms, Wheelsize is 4, then the second-level bucket of tickms is 4, set its wheelsize is also 3. So, the third level of Timingwheel's tickms is 12.

1. Setting the minimum level of Timingwheel is 2ms behind the physical time.

Suppose that the bucket in which the poll comes out is a B c, and their expiration time is 4 5 6 (physical time), respectively. The physical time of reinsert execution is 6, while Timingwheel's currenttime is 4.

So when reinsert executes, all the elements in this bucket are considered to have expired, which is correct.

2. Assume that the lowest level of Timingwheel is more than the physical time lag than the tickms, set backward 8ms.

Poll out is the third level of the bucket,reinsert when the physical time is 21, the bucket is in the physical time 13 was poll out, so the first three levels of wheel currentime was set to 12.

The poll out of this bucket element a B C expires at 14 20 22, respectively. Then the Add method of the lowest layer Timingwheel will assume that the expiration time element in [12, 15] has expired, so B and C are not immediately committed to the Taskexecutor thread pool execution, but are reinserted into the lowest level of Timingwheel. And because the lowest level of Timingwheel overflow threshold value is 15, so B and C will be submitted to its upper Timingwheel, and the second level timingwheel can be managed by the range of TimerTask is [12, 23], So B and C will be handed over to the second level of Timingwheel.

But in any case, it will always be added to the right bucket (the bucket expires at the expiration time of the TimerTask), and the bucket will be next (or several times ...). In the poll.

As long as the lowest level of Timingwheel's currenttime is not always fixed to a value, the expired TimerTask is bound to be submitted for execution. And the time it is executed depends on how many elements have expired before it, that is, it will not be postponed indefinitely (starve), although it may still daoteng between Timingwheel hierarchies after it expires.

Then, it can be thought that the actual time a timertask is executed depends on

The error caused by the tickms. For tickms reasons, a timertask may be assigned to a bucket that expiration time is smaller than it. But Delayedoprationpurgatory set the tickms to 1ms, so this error can be ignored.
The Reaper thread delays the poll element from the Delayqueue and handles the delay caused by the poll element. These are basically not controllable. This will cause a timertask to be committed at some point after the expiration time. The allocation of GC and CPU time can affect this delay. One important factor is the time to get Writelock, because only the reapder thread will try to get writelock, so it only needs to face the competition to get readlock threads. And every time I add a timertask to the timer, I get readlock. So as long as this reentrancereadwritelock is a fair lock, doing so is no problem. But Kafka did not set the Reentrantreadwritelock to fair mode when the timer was implemented.
```
  // Locks used to protect data structures while ticking  Private [thisnew Reentrantreadwritelock ()
```
Therefore, if the competition for locks is strong, theoretically, Reaper may not be able to acquire write locks, which may cause the JVM to be oom.

Summarize

In short, it can be thought that when the pressure is low (Reaper can quickly get elements from the Delayqueue, and the competition for read-write locks is not fierce, Reaper thread can get enough CPU time), the timer is quite accurate. If the throughput is very large, it's hard to say. Moreover, due to an unfair read/write lock, it is possible to insert the timertask faster than the removal timertask, resulting in oom.

Some areas of analysis may not be correct, and people wishing to read it can point out.

Delayedoperationpurgatory's Timer

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Delayedoperationpurgatory's Timer

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Delayedoperationpurgatory's Timer

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support