This article comes from NetEase cloud community.
Time Wheel Implementation
A time wheel is a circular data structure, divided into multiple lattices.
Each lattice represents a period of time, the shorter the time, the higher the accuracy.
Each grid is saved with a list of expired tasks in that grid.
The pointer rotates over time and executes the expiration task in the corresponding grid.
Noun Explanation:
- Time Lattice: a block in a circular structure for storing delayed tasks
- Pointer: A time grid that points to the current operation, representing the current time
- Number of cells: the number of time lattices in the time wheel
- Interval: The interval between each time lattice, which represents the precision that the time wheel can achieve
- Total interval: The total interval of the current time, equal to the number of cells * interval, which represents the time range that the time wheel can express
Single-Watch Time wheel
For example, if a lattice is 1 seconds, then the entire time wheel can represent the time period of 8s, if the current pointer to 2, at this time need to dispatch a 3s after the execution of the task, need to put in the 5th lattice (2+3), the pointer will be transferred 3 times can be executed.
The problem with the single-table time wheel is:
Lattice is limited in number, can represent a limited amount of time, when to store a 10s after the expiration of the task what to do? This can cause a time-wheel overflow.
One way to do this is to save the turn information to the task on the timeline list.
If the task is to be executed after 10s, calculate the round 10/8 round and so on 1, lattice 10%8 equals 2, so put in the second grid.
When checking for overdue tasks, you should only perform tasks with round 0, and round of other tasks in the list minus 1.
The problem with the round single-table time wheel is:
If the time span of the task is large, the number is large, the single-layer time wheel will cause the round of the task is very large, a single lattice of the list is very long, the amount of each check is very large, will do a lot of invalid check. What to do?
Layered Time Wheel
Overdue tasks must be performed in the underlying wheel, and tasks in other time wheels will continue to degrade into the lower tier of the time wheel as they approach expiration.
Each wheel in a layered time wheel has its own lattice number and interval setting, and when the lowest layer of time is rotated round, the time wheel of the first layer turns a lattice.
The layered time wheel greatly increases the range of time that can be represented, while reducing space consumption.
As an example:
The layered time wheel can express 8 8 8=512s time range, if the use of a single-table time round may require 512 lattice, while the layered time wheel as long as the 8+8+8=24 lattice, if the design of a time range is 1 days of layered time round, three rounds of the grid with 24, 60, 60.
Working principle:
There are two ways in which the time wheel pointer rotates:
- According to their own interval rotation (second wheel 1 seconds to 1, the minute wheel 1 minutes to 1 grid; clock wheel 1 hours to 1)
- Push through the lower time wheel (1 laps in seconds, rotate 1 in minutes, rotate 1 times in minutes, rotate 1 in a clock)
There are two ways to handle a pointer when it goes to a particular lattice:
- If it is the bottom wheel, the pointer points to the elements on the linked list in the grid indicating expiration
- If it is another wheel, move the task on the grid to a time wheel of fine precision, such as the task of the clock wheel moving to the minute wheel
As an example:
- Tasks performed after adding 1 5s
- Figure out the task should be placed in the second round of the 5th lattice
- The task will be executed after the second wheel hand is rotated 5 times.
- Add a task to perform after 50s
- calculates that the delay time of the task has overflowed the second round
- 50/8=6, so the task will be saved in the minute wheel of the 6th lattice
- After 6 laps (6*8s=48s) in the second wheel, the hands of the minute wheel point to the 6th lattice
- At this point the task in the grid is downgraded to the second wheel, and according to 50%8=2, the task is moved to the 2nd lattice of the second wheel.
- After the second wheel hand is rotated 2 times (50s) The task will be executed
- Add a task to perform after 250s
- calculates that the delay time of the task has overflowed the minute wheel
- 250/8/8=3, so the task will be saved in the 3rd lattice of the clock wheel
- After 3 laps (3*64s=192s) in the minute wheel, the hand of the clock wheel points to the 3rd lattice
- At this point the task in the grid is downgraded to the minute wheel, and according to (250-192)/8=7, the task will be moved to the 7th lattice of the minute wheel.
- After 7 laps (7*8s=56s) in the second wheel, the hands of the minute wheel point to the 7th lattice
- At this point the task in the grid is downgraded to the second wheel, and according to (250-192-56) = 2, the task will be moved to the 2nd lattice of the second wheel
- The task will be executed after the second wheel hands are rotated 2 times.
Advantages:
- High performance (Insert task, delete task time complexity is O (1), delayqueue due to sorting, insertion and removal of complexity is O (LOGN))
Disadvantages:
- Data is saved in memory and needs to be persisted by itself
- Do not have distributed capability and need to achieve high availability
- Delayed task expiration is limited by the total time round interval
For out-of-scope tasks can be placed in a buffer (available queue, Redis or database implementation), and so on the highest time to rotate to the next grid to remove the matching range of tasks from the buffer falls into the time wheel.
For example:
- Add a 600s post-execute task A
- calculates that the delay time of the task has overflowed the time wheel
- So the task is saved to the buffer queue
- After the clock wheel has gone 1 grid, the task of satisfying the range from the buffer queue falls into the time wheel.
- All task delay time in the buffer queue is subtracted by 64s, task a minus 64s is 536s, still greater than the time wheel range, so it will not be moved out of the queue
- After the clock wheel has gone another 1, task a minus 64s is 536-64=472s, and within the time wheel range, it will fall into the clock wheel
Previous design (Db/delayqueue/zookeeper)
Dispatch system provides the task operation interface for the business system to submit tasks, cancel tasks, feedback execution results and so on.
For Dubbo calls, the task is abstracted into a jobcallbackservice interface, implemented by the business system and registered as a service.
Overall architecture
Database:
- Responsible for saving all the task data
Memory Queue:
- The mechanism by which the delay task is actually triggered by delayqueue is guaranteed by it.
- Store up to 1000 tasks expiring in the next n minutes only
ZooKeeper:
- Manage the entire scheduling cluster
- Storage scheduling Node Information
- Storage Node Shard information
Master node:
- Re-sharding the data when there are new nodes on the upper and lower lines
Scheduling node:
- Provide Dubbo, HTTP interface for business system calls, for submitting tasks, canceling tasks, feedback execution results, etc.
- Obtain shard information for the current node from the ZK registry, and then pull the expiring data from the database to Delayqueue
- Invoke the callback service interface registered by the business system to initiate a dispatch request
- Receive feedback from business systems, update execution results, remove tasks or initiate retries
Business System:
- The callback interface Jobcallbackservice is required as a scheduled service and is registered as a Dubbo service provider
- Call the dispatch System interface operation task in scenarios that require deferred tasks
Database design
Table description
- Job_callback_service: Service Configuration table, configuring the business callback service, including service protocol, callback service, retry count
- Job_delay_task: Delay task table for storing deferred tasks, including task Shard number, callback service, total number of calls, number of failures, task status, callback parameters, etc.
- Job_delay_task_execlog: Delay the task execution table and record every callback initiated by the dispatch system
- Job_delay_task_backlog: Delay Task Scheduling results table, record the final status of the task and other information
Master-Slave switching
Using the Zookeeper Temporal sequence node attribute, the node with the smallest ordinal number is the primary node, and the other node is the slave node.
The primary node listens to the cluster state and re-shards when the cluster state changes.
From the node listening serial number than its small sibling nodes, the sibling node changes to re-find and establish a monitoring relationship.
Data sharding
Task status
- Delay: The initial state after a deferred task is committed
- Ready: The status of the message pushed into the readiness queue when the expiration time has passed
- Running: The status of the business subscription message that receives the message to begin processing
- Finished: Business Process success
- Failed: Business processing failed
Main process
Service Load
- Read service configuration from DB
- Dynamically constructs consumer objects according to configuration and adds them to the spring container
Submit a Task
- Business systems submit tasks via Dubbo or HTTP interfaces
- Determine if the task expiration time is within a scan cycle
- If it is,
- Set the Shard number (random fetch from The Shard responsible for the current node)
- Add to Memory queue
- Tasks saved to Job_delay_task table
- If no,
- Set the Shard number (calculate the Shard number based on the total number of shards and the stochastic algorithm)
- Tasks saved to Delay_task table
Timer
- Managed by a single thread
- Set the timer's execution cycle based on the configured scan interval
- Calculates the expiration time for this period based on the current time and the scan interval x-delay
- Get the expiration time from the DB All tasks before X-delay, and put it to delayqueue
scheduling tasks
- Managed by a single thread pool
- All threads are blocked in the Delayqueue method take
- Take the task, get the task from the DB, and determine if there is
- If not, do nothing (the task has been executed successfully or has been deleted)
- If present, determine if the number of calls exceeds the set
- If not super
- Invoking the business callback service
- Remove the invoked service configuration from the task
- Get the corresponding consumer object from the container
- Calling the business callback service asynchronously
- Set the next retry time, logging the call log Job_delay_task_execlog
- If over, move the task to Job_delay_task_backlog
Task Feedback
- Update task Invocation Results
Advantages
- Full-featured, highly available, easy to scale, and retry
Disadvantages
- Slightly complex
- Need to dynamically make service configuration a consumer object
- Adding a new service requires notifying all scheduled node refreshes
- There is a certain coupling (direct invocation of business Services, protocol coupling), if the access system is the thrift protocol?
- Retry to process the task
- Dispatch system directly callback business services, if the business service is not available may cause blind retry, not good control of traffic (scheduling system does not know the processing capacity of business services)
If you introduce MQ, use MQ to decouple the protocol from the service invocation, to ensure that the task is retried, and that the consumer will be better able to control traffic based on its own processing power?
Another scenario (DB/DELAYQUEUE/ZOOKEEPER/MQ)
Overall architecture
Database design
Main process
scheduling tasks
- Managed by a single thread pool
- All threads are blocking the Take method in the Delayqueue
- Take the task, get the task from the DB, and determine if there is
- If not, do nothing (the task has been executed successfully or has been deleted)
- If present, transfer the task to Job_delay_task_execlog; Post message to Message queue
Disadvantages
Requires that the business system relies on MQ
This article from the NetEase cloud community, by the author Zhiliang authorized release.
Delayed task scheduling System (technology selection and design)