Analysis of Storm's mechanism in batch processing and transaction

Source: Internet
Author: User
Tags emit

1. The proposed storm transactional topology

For a fault-tolerant mechanism, storm uses a system-level component Acker, combined with an XOR check mechanism, to determine whether a MSG is sent successfully, and then spout to resend the MSG to ensure that a MSG is re-sent at least once in the case of an error. However, in some scenarios where transactional requirements are high, it is necessary to guarantee only one time semantics, such as the need to accurately count the number of tuples and so on. Storm 0.7.0 introduces the transactional topology, which guarantees that each tuple "is processed only once", so you can implement a very accurate, very extensible, and highly fault-tolerant way to implement the counting class application.

2, API Introduction

There are three methods of Ibatchbolt

Execute (tuple tuple)

Finishbatch ()

Prepare (Java.util.Map conf, topologycontext context, Batchoutputcollector collector,t ID)

Itransactionalspout has the following main methods:

Itransactionalspout.coordinator<t> Getcoordinator (Java.util.Map conf,

Topologycontext context)

Itransactionalspout.emitter<t> Getemitter (Java.util.Map conf,

Topologycontext context)

3. Principle analysis of transaction mechanism

1) for one time only semantics, from the principle, need to send a tuple when the XID, when the need for transaction processing, depending on whether the XID has been successfully processed before deciding whether to process, of course, you need to save XID and processing results together. and the Order of guarantee is required, before the current request XID commits, all the lower XID requests are submitted.

A single processing tuple is inefficient at transaction processing, so batch processing is introduced into storm, and a batch of tuples is assigned to a XID, and Storm uses a pipeline-processed model to improve the degree of parallelism between batch processing. See pipeline model, multiple transactions can be executed in parallel, but commit is strictly in order.

corresponding to the specific implementation in storm, the calculation of a batch is divided into two phases processing and commit phases:

Processing stage: Multiple batch can be calculated in parallel, the above example BOLT2 is the normal Batchbolt (Implementation Basebatchbolt), then multiple batch between the BOLT2 task can be executed in parallel, For example, execute or finishbatch is executed in parallel to Batch3 and BATCH4 (when the operation is called, as described later) methods.

Commiting phase: The batch is forced to commit sequentially, BOLT3 implements Basebatchbolt and the token requires transaction processing ( The Icommitter interface is implemented or the Batchbolt is added to topology via the Setcommitterbolt method of Transactionaltopologybuilder, then Storm thinks it can be submitted ( As to when can be submitted, the following will be introduced) batch when the call Finishbatch, in Finishbatch do XID comparison and state preservation work. In the example, BATCH2 must wait for batch1 to commit before it can commit.

The storm transactional topology looks complex, requires batch commit management, false discovery, batch launch, and processing, and its internal implementation is entirely based on Storm's underlying operations.

When using transactional topologies, Storm does these things for you:

    • Management state: Storm keeps all the necessary states for transactional topologies in zookeeper. This includes the current transaction ID and some meta data that defines each batch.
    • Coordinating transactions: Storm helps you manage everything to help you decide whether to be the proccessing or the committing at any point in time.
    • Error detection: Storm uses the acking framework to efficiently detect when a batch has been successfully processed, successfully committed, or failed. Storm then replay the corresponding batch accordingly. You don't need to do any acking or anchoring-storm yourself to get everything done.
    • The built-in batching Api:storm wraps a layer of APIs on top of ordinary bolts to provide batch support for tuple. Storm manages all the coordination work, including deciding when a bolt receives all the tuple of a particular transaction. Storm also automatically cleans up the intermediate data generated by each transaction.
    • Finally, it is important to note that transactional topologies needs a queue system (message queue) that can completely re-send messages (replay) for a specific batch. Storm-contrib inside the Storm-kafka realized this.

Transactional topology, in terms of implementation, includes transactional spout, as well as transactional bolts.

2) transactional spout need to implement Itransactionalspout, which consists of two internal classes coordinator and emitter. When topology is running, the transactional spout contains a sub-topology, similar to the following structure:

Where coordinator is Spout,emitter is bolt.

There are two types of tuple, one is transactional tuple, and the other is a tuple in real batch;

Coordinator for transactional batch emission tuple,emitter is responsible for actually transmitting a tuple for each batch.

Specific as follows:

    • Coordinator only one, emitter can have multiple instances depending on the degree of parallelism
    • Emitter Subscribe to Coordinator "batch emit" stream with all grouping (broadcast)
    • Coordinator (in fact an internal spout) opens a transaction ready to launch a batch, entering the processing phase of a transaction, which launches a transactional tuple (Transactionattempt & Metadata) to the "batch emit" stream

Description ******

The tuple sent in transactionaltopology must take transactionattempt as the first field,storm according to this field to determine which batch the tuple belongs to.

Transactionattempt contains two values: one transaction ID, one attempt ID. The purpose of the transaction ID is that the tuple in each batch is the only one we've described above

, and no matter how many times the batch replay is the same. The attempt ID is a unique ID for each batch, but for the same batch, the attempt ID after replay is not the same as replay.

We can interpret the attempt ID as Replay-times, and Storm uses this ID to differentiate between different versions of a batch-fired tuple

Metadata (metadata) contains the point at which the current transaction can replay data, stored in zookeeper, spout can serialize and deserialize the metadata from zookeeper by Kryo.

**************

    • After the emiter receives this tuble, it will send the batch tuple.
    • Storm through the anchoring/acking mechanism to detect whether the transaction has completed the processing phase;
    • After the processing phase is complete and the previous transactions has been submitted, coordinator launches a tuble to "commit" stream and enters the commit phase.
    • Commiting bolts through the all grouping way to subscribe to the "commit" stream, after the transaction commits, Coordinator also through the anchoring/acking mechanism to confirm that the commit phase has been completed, after receiving an ACK, Mark the transaction as complete on the zookeeper.

3) Transactional bolt inheritance Basetransactionalbolt, which handles batch tuples, calls the Execute method for each tuple call, while the entire batch processing (processing) Call the Finishbatch method when finished. If Batchbolt is marked as Committer, the Finishbolt method can only be called in the commit phase. The commit phase of a batch is guaranteed by storm to be executed only after the previous batch has been successfully submitted. And it will retry until all the bolts inside the topology commit. So how to know the processing of batch is complete, that is, Bolt received processing all the tuple in batch, inside the bolt, there is a coordinatedbolt model.

Coordinatebolt specific principles are as follows:

Coordinatebolt specific principles are as follows:

    • A coordinatebolt is encapsulated outside the bolt that actually performs the calculation. The bolt that really performs the task is called Real Bolt.
    • Each Coordinatebolt records two values: What tasks sent me a tuple (according to topology's grouping information); Which tuple I want to send messages to (also based on groping information)
    • After the Real bolt emits a tuple, its outer coordinatebolt records which task the tuple sends to.
    • After all the tuples have been sent, Coordinatebolt tells all of the task that it sent a tuple through another special stream, which sends the number of tuples to the task Emitdirect. The downstream task compares this number to the number of tuples that it has received, and if it is equal, it means that all the tuples have been processed.
    • Downstream Coordinatebolt will repeat the above steps to inform the downstream.

One of the applications of transactional topologies in Storm is Trident, which is to do a higher level of abstraction on the basis of storm primitives and transactional semantics, consistent and exactly once.

Analysis of Storm's mechanism in batch processing and transaction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.