Good delay queue Design

Source: Internet
Author: User

The delay queue, as the name implies, is a message queue with deferred functionality. So, what is the scenario in which I need such a queue?

Background

Let's look at the following business scenarios first:

    • How to close the order in time and return the inventory when the order has been in a non-payment state?
    • How do I regularly check if my refund status has been successfully refunded?
    • Newly created store, n days without uploading a product, how does the system know that information and send an activation SMS? Wait a minute

In order to solve the above problems, the simplest and most straightforward way is to regularly sweep the table. Each business maintains its own sweep table logic. As the business grows, we will find that the logic of the parts of the table is very similar. We might consider taking this part of the logic out of the concrete business logic and turning it into a public part.
So does the open source sector already have a ready-made solution? The answer is yes. BEANSTALKD (http://kr.github.io/beanstalkd/), it has basically met the above requirements. However, it is not particularly convenient to delete a message, and it requires more cost. Moreover, it is based on the C language development, then our team mainstream is PHP and Java, can not do two times development. So we draw on its design thinking, and Java re-implemented a delay queue.

Design goals
    • Message transmission reliability: After a message enters the delay queue, it is guaranteed to be consumed at least once.
    • Client support is rich: at least PHP and Python are supported due to business needs.
    • High availability: Support for at least multiple instance deployments. After an instance is hung, there is a fallback instance that continues to serve.
    • Real-time: allow a certain amount of time error.
    • Support Message deletion: Business consumer, can delete the specified message at any time.
Overall structure

The entire delay queue consists of 4 parts:

    • The job pool is used to store meta information for all jobs.
    • The delay bucket is an ordered queue of time-based dimensions for all jobs that need to be delayed/have been saved (only the job Id is stored here).
    • The timer is responsible for scanning each bucket in real time and placing a job with a delay greater than or equal to the current time into the corresponding ready Queue.
    • The ready queue stores the job in a readiness state (only the job Id is stored here) for consumption by the consuming program.

As stated:

Basic concept of Design essentials
    • Job: The task that requires asynchronous processing is the base unit in the delay queue. Associated with a specific topic.
    • Topic: A set (queue) of the same type of job. For consumers to subscribe.
Message structure

Each job must contain several properties:

    • The Topic:job type. Can be understood as a specific business name.
    • Unique identification of the id:job. Used to retrieve and delete the specified job information.
    • Delay:job time required for delay. Units: seconds. (The server will convert it to absolute time)
    • TTR (time-to-run): Job execution time-out. Units: seconds.
    • Body:job content for consumers to do specific business processing, in JSON format storage.

Specific structure as indicated: TTR is designed to guarantee the reliability of message transmission.

Message State transitions

Each job will only be in one state:

    • Ready: Executable state, waiting for consumption.
    • Delay: Non-executable state, waiting for clock cycles.
    • Reserved: has been read by the consumer, but has not yet received a response from the consumer (delete, finish).
    • Deleted: The consumption has been completed or has been deleted.

The following is a conversion of four states:

Message store

Before selecting a storage medium, determine the following specific data structure:

    • Job Poll store The job meta information, only need to k/v form of structure. Key is job id,value for job struct.
    • The Delay bucket is an ordered queue.
    • The ready queue is a normal list or queue.

Can meet the above requirements at the same time, non-Redis mo.
The data structure of the bucket is the zset of Redis, which is divided into buckets to improve scanning speed and reduce message latency.

Communication protocols

In order to satisfy the support of multi-lingual client, we choose the HTTP communication method and use the text Protocol (JSON) to realize the interaction with client side. The following protocols are currently supported:

    • Added: {' command ': ' Add ', ' topic ': ' xxx ', ' id ': ' xxx ', ' delay ': ', ' TTR ': ', ' body ': ' xxx '}
    • Get: {' command ': ' Pop ', ' topic ': ' xxx '}
    • Complete: {' command ': ' Finish ', ' id ': ' xxx '}
    • Delete: {' command ': ' delete ', ' id ': ' xxx '}

The body is also a JSON string.
Response structure: {' success ': True/false, ' ERROR ': ' Error reason ', ' id ': ' xxx ', ' value ': ' Job body '}
Emphasize that: The job ID is determined by the business consumer, so be sure to ensure global uniqueness. A combination of topic+ business unique IDs is recommended here.

Illustrate the life cycle of a job
    • The user orders a product, the system creates the order successfully, and puts a job into the delay queue. The job structure is: {' topic ': ' Orderclose ', ' id ': ' Ordercloseordernoxxx ', ' delay ': 1800, ' TTR ': ', ' body ': ' XXXXXXX '}
    • After the delay queue receives the job, it first stores the job information in the job pool, calculates the absolute execution time based on delay, and puts the job ID into a bucket in the form of polling (Round-robbin).
    • The timer polls each bucket at all times, and after 1800 seconds (30 minutes), check that the job execution time is up, and get the job ID from the job pool for meta information. If the job is in deleted state, pass, continue polling, if the job is in a non-deleted state, first verify that the delay in the meta-information is greater than or equal to the current time, and if so, put the job ID into the corresponding ready queue according to topic. It is then removed from the bucket, if it is not satisfied, the delay time is recalculated, the bucket is placed again, and the previous job ID is removed from the bucket.
    • The consumer polls the corresponding topic's ready queue (where it is still reasonable to determine the job) and gets the job to do its own business logic. At the same time, the server will recalculate the execution time and put it into a bucket, according to the TTR it has set up for the job that has been acquired by the consumer.
    • After the consumer finishes processing the service, the server responds to the finish and the service side deletes the corresponding meta-information according to the job ID.
Existing physical topology

A centralized storage mechanism is currently in place, and the timer program may execute concurrently during multi-instance deployments, causing the job to be repeatedly placed in the ready queue. To solve this problem, we used the Redis setnx command to implement a simple distributed lock to ensure that each bucket has only one timer thread to scan at a time.

A place of inadequate design

The timer is implemented through an infinite loop of independent threads, which can be wasteful to the CPU when there is no ready job.
At the time of the reserve job, the consumer is using the HTTP short polling method, and only one job can be taken at a time. If the ready job is larger, the network I/O is consumed.
Redis is used for data storage, and messages are persisted by Redis features.
Scale-out relies on third-party (Nginx).

Future architecture Direction

Timer implementation based on the wait/notify mode.
Provides TCP long-connected APIs to implement push or long-polling message reserve methods.
Has its own storage scheme (embedded database, custom data structure write file), to ensure the persistence of messages.
To achieve their own name-server.
Consider providing direct support for recurring tasks.


In the absence of any special instructions, this document is copyrighted and licensed by the author and the technical team, using the Attribution-NonCommercial 4.0 International license.
Reproduced please specify: from the likes of the technical team blog http://tech.youzan.com/queuing_delay/

Good delay queue Design

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.