Good delay queue Design

Last Update:2016-03-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The delay queue, as the name implies, is a message queue with deferred functionality. So, what is the scenario in which I need such a queue?

Background

Let's look at the following business scenarios first:

How to close the order in time and return the inventory when the order has been in a non-payment state?
How do I regularly check if my refund status has been successfully refunded?
Newly created store, n days without uploading a product, how does the system know that information and send an activation SMS? Wait a minute

In order to solve the above problems, the simplest and most straightforward way is to regularly sweep the table. Each business maintains its own sweep table logic. As the business grows, we will find that the logic of the parts of the table is very similar. We might consider taking this part of the logic out of the concrete business logic and turning it into a public part.
So does the open source sector already have a ready-made solution? The answer is yes. BEANSTALKD (http://kr.github.io/beanstalkd/), it has basically met the above requirements. However, it is not particularly convenient to delete a message, and it requires more cost. Moreover, it is based on the C language development, then our team mainstream is PHP and Java, can not do two times development. So we draw on its design thinking, and Java re-implemented a delay queue.

Design goals

Message transmission reliability: After a message enters the delay queue, it is guaranteed to be consumed at least once.
Client support is rich: at least PHP and Python are supported due to business needs.
High availability: Support for at least multiple instance deployments. After an instance is hung, there is a fallback instance that continues to serve.
Real-time: allow a certain amount of time error.
Support Message deletion: Business consumer, can delete the specified message at any time.

Overall structure

The entire delay queue consists of 4 parts:

The job pool is used to store meta information for all jobs.
The delay bucket is an ordered queue of time-based dimensions for all jobs that need to be delayed/have been saved (only the job Id is stored here).
The timer is responsible for scanning each bucket in real time and placing a job with a delay greater than or equal to the current time into the corresponding ready Queue.
The ready queue stores the job in a readiness state (only the job Id is stored here) for consumption by the consuming program.

As stated:

Basic concept of Design essentials

Job: The task that requires asynchronous processing is the base unit in the delay queue. Associated with a specific topic.
Topic: A set (queue) of the same type of job. For consumers to subscribe.

Message structure

Each job must contain several properties:

The Topic:job type. Can be understood as a specific business name.
Unique identification of the id:job. Used to retrieve and delete the specified job information.
Delay:job time required for delay. Units: seconds. (The server will convert it to absolute time)
TTR (time-to-run): Job execution time-out. Units: seconds.
Body:job content for consumers to do specific business processing, in JSON format storage.

Specific structure as indicated: TTR is designed to guarantee the reliability of message transmission.

Message State transitions

Each job will only be in one state:

Ready: Executable state, waiting for consumption.
Delay: Non-executable state, waiting for clock cycles.
Reserved: has been read by the consumer, but has not yet received a response from the consumer (delete, finish).
Deleted: The consumption has been completed or has been deleted.

The following is a conversion of four states:

Message store

Before selecting a storage medium, determine the following specific data structure:

Job Poll store The job meta information, only need to k/v form of structure. Key is job id,value for job struct.
The Delay bucket is an ordered queue.
The ready queue is a normal list or queue.

Can meet the above requirements at the same time, non-Redis mo.
The data structure of the bucket is the zset of Redis, which is divided into buckets to improve scanning speed and reduce message latency.

Communication protocols

In order to satisfy the support of multi-lingual client, we choose the HTTP communication method and use the text Protocol (JSON) to realize the interaction with client side. The following protocols are currently supported:

Added: {' command ': ' Add ', ' topic ': ' xxx ', ' id ': ' xxx ', ' delay ': ', ' TTR ': ', ' body ': ' xxx '}
Get: {' command ': ' Pop ', ' topic ': ' xxx '}
Complete: {' command ': ' Finish ', ' id ': ' xxx '}
Delete: {' command ': ' delete ', ' id ': ' xxx '}

The body is also a JSON string.
Response structure: {' success ': True/false, ' ERROR ': ' Error reason ', ' id ': ' xxx ', ' value ': ' Job body '}
Emphasize that: The job ID is determined by the business consumer, so be sure to ensure global uniqueness. A combination of topic+ business unique IDs is recommended here.

Illustrate the life cycle of a job

The user orders a product, the system creates the order successfully, and puts a job into the delay queue. The job structure is: {' topic ': ' Orderclose ', ' id ': ' Ordercloseordernoxxx ', ' delay ': 1800, ' TTR ': ', ' body ': ' XXXXXXX '}
After the delay queue receives the job, it first stores the job information in the job pool, calculates the absolute execution time based on delay, and puts the job ID into a bucket in the form of polling (Round-robbin).
The timer polls each bucket at all times, and after 1800 seconds (30 minutes), check that the job execution time is up, and get the job ID from the job pool for meta information. If the job is in deleted state, pass, continue polling, if the job is in a non-deleted state, first verify that the delay in the meta-information is greater than or equal to the current time, and if so, put the job ID into the corresponding ready queue according to topic. It is then removed from the bucket, if it is not satisfied, the delay time is recalculated, the bucket is placed again, and the previous job ID is removed from the bucket.
The consumer polls the corresponding topic's ready queue (where it is still reasonable to determine the job) and gets the job to do its own business logic. At the same time, the server will recalculate the execution time and put it into a bucket, according to the TTR it has set up for the job that has been acquired by the consumer.
After the consumer finishes processing the service, the server responds to the finish and the service side deletes the corresponding meta-information according to the job ID.

Existing physical topology

A centralized storage mechanism is currently in place, and the timer program may execute concurrently during multi-instance deployments, causing the job to be repeatedly placed in the ready queue. To solve this problem, we used the Redis setnx command to implement a simple distributed lock to ensure that each bucket has only one timer thread to scan at a time.

A place of inadequate design

The timer is implemented through an infinite loop of independent threads, which can be wasteful to the CPU when there is no ready job.
At the time of the reserve job, the consumer is using the HTTP short polling method, and only one job can be taken at a time. If the ready job is larger, the network I/O is consumed.
Redis is used for data storage, and messages are persisted by Redis features.
Scale-out relies on third-party (Nginx).

Future architecture Direction

Timer implementation based on the wait/notify mode.
Provides TCP long-connected APIs to implement push or long-polling message reserve methods.
Has its own storage scheme (embedded database, custom data structure write file), to ensure the persistence of messages.
To achieve their own name-server.
Consider providing direct support for recurring tasks.

In the absence of any special instructions, this document is copyrighted and licensed by the author and the technical team, using the Attribution-NonCommercial 4.0 International license.
Reproduced please specify: from the likes of the technical team blog http://tech.youzan.com/queuing_delay/

Good delay queue Design

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Good delay queue Design

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Good delay queue Design

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support