How Storm Batch transactions work

Source: Internet
Author: User
Tags emit

1. transaction-Batch Processing

For the fault-tolerant mechanism, storm passes a system-level component Acker, combining the XOR check mechanism to determine whether a tuple is sent successfully, and spout can resend the tuple to ensure that a tuple is re-sent at least once in the case of k\ error.

However, when you need to accurately count the number of tuples, such as the sales amount scenario, you want each tuple to "be processed only once". Storm 0.7.0 introduced the transactional topology, which guarantees that each tuple "is processed only once", so that we can implement a very accurate and highly fault-tolerant way to implement the Count class application.

Process individual tuples one by one, adding a lot of overhead, such as writing libraries and outputting results too often

Transaction processing single tuple is inefficient, so batch processing is introduced into storm,

Batching is a one-time batch (batch) tuple that ensures that the batch is processed successfully, and that if there are no processing failures, Storm will resend the failed batches and ensure that each batch is and is processed only once.

2, API Introduction

There are three methods of Ibatchbolt

Execute (tuple tuple)

Finishbatch () processes the results of the entire batch after all the tuple processing is complete and executes the method when the commit is executed

Prepare (Java.util.Map conf, topologycontext context, Batchoutputcollector collector,t ID)

Itransactionalspout has the following main methods:

Itransactionalspout.coordinator<t> Getcoordinator (Java.util.Map conf,

Topologycontext context)

Itransactionalspout.emitter<t> Getemitter (Java.util.Map conf,

Topologycontext context)

3. principle of transaction mechanism

1) for the need to process only once, in principle, the need to send a tuple ( single or a batch) with the same transaction ID (TXID),

when it's processed, it's based on the Txid. determine if it has been processed. Once processed, the processing results and Txid are saved so that they are later compared and need to be guaranteed in order, and all TXID requests that are lower than their own are required to be submitted before the current request Txid commit .

at the time of transaction batching, a batch of tuple give an individual Txid , in order to improve batch degree of parallelism between processing

Storm using Pipline ( piping) process the model so that multiple transactions can be executed in parallel, but are committed in a strictly sequential

2. In Storm transaction processing, the calculation of a batch is divided into two phases,processing and commit phases

Process stages: Multiple Batch can be calculated in parallel

commiting Stage: Batch forced to commit in order

Processing stage: Multiple batch can be calculated in parallel, the above example BOLT2 is the normal Batchbolt (Implementation Basebatchbolt), then multiple batch between the BOLT2 task can be executed in parallel, For example, execute or finishbatch is executed in parallel to Batch3 and BATCH4 (when the operation is called, as described later) methods.

Commiting phase: The batch is forced to commit sequentially, BOLT3 implements Basebatchbolt and the token requires transaction processing ( Implementing the Icommitter interface or adding Batchbolt to topology via the Setcommitterbolt method of Transactionaltopologybuilder, then Storm thinks it can commit (as to when can be submitted, will be introduced later) batch when the call Finishbatch, in Finishbatch do XID comparison and state preservation work. In the example, BATCH2 must wait for batch1 to commit before it can commit.

The storm transactional topology looks complex, requires batch commit management, false discovery, batch launch, and processing, and its internal implementation is entirely based on Storm's underlying operations.

When using transactional topologies, Storm does these things for you:

    • Management state: Storm keeps all the necessary states for transactional topologies in zookeeper. This includes the current transaction ID and some meta data that defines each batch.
    • Coordinating transactions: Storm helps you manage everything to help you decide whether to be the proccessing or the committing at any point in time.
    • Error detection: Storm uses the acking framework to efficiently detect when a batch has been successfully processed, successfully committed, or failed. Storm then replay the corresponding batch accordingly. You don't need to do any acking or anchoring (emit happen) yourself-storm help you get everything done.
    • The built-in batching Api:storm wraps a layer of APIs on top of ordinary bolts to provide batch support for tuple. Storm manages all the coordination work, including deciding when a bolt receives all the tuple of a particular transaction. Storm also automatically cleans up the intermediate data generated by each transaction.
    • Finally, it is important to note that transactional topologies needs a queue system (message queue) that can completely re-send messages (replay) for a specific batch. Storm-contrib inside the Storm-kafka realized this.

Transactional topology, in terms of implementation, includes transactional spout, as well as transactional bolts.

2) transactional spout need to implement Itransactionalspout, which consists of two internal classes coordinator and emitter. When topology is running, the transactional spout contains a sub-topology, similar to the following structure:

Interface itransactionalspout.coordinator<x>
method Summary
 void Close ()  
 x Strong>initializetransaction (Java.math.biginteger txid, X prevmetadata)     Initialize startup transaction, Prevmetadata: Metadata
 boolean isReady ()     Returns True when the next transaction can continue to start

where coordinator is Spout,emitter is bolt.

There are two types of tuple, one is transactional tuple, and the other is a tuple in real batch;

Coordinator for transactional batch emission tuple,emitter is responsible for actually transmitting a tuple for each batch.

Specific as follows:

    • Coordinator only one, emitter can have multiple instances depending on the degree of parallelism
    • Emitter Subscribe to Coordinator "batch emit" stream with all grouping (broadcast)
    • Coordinator (in fact an internal spout) opens a transaction ready to launch a batch, entering the processing phase of a transaction, which launches a transactional tuple (transactionattempt & Metadata) to the "batch emit" stream

Description ******

  

the tuple sent in transactionaltopology must take transactionattempt as the first field,storm according to this field to determine which batch the tuple belongs to.

transactionattempt contains two values: one transaction ID, one attempt ID. The purpose of the transaction ID is that the tuple in each batch is the only one we've described above

, and no matter how many times the batch replay is the same. The attempt ID is a unique ID for each batch, but for the same batch, the attempt ID after replay is not the same as replay.

We can interpret the attempt ID as Replay-times, and Storm uses this ID to differentiate between different versions of a batch-fired tuple

Metadata (metadata) contains the point at which the current transaction can replay data, stored in zookeeper, spout can serialize and deserialize the metadata from zookeeper by Kryo.

**************

    • After the emiter receives this tuple, the batch tuple is sent
    • Storm through the anchoring/acking mechanism to detect whether the transaction has completed the processing phase;
    • After the processing phase is complete and the previous transactions has been submitted, coordinator launches a tuble to "commit" stream and enters the commit phase.
    • Commiting bolts through the all grouping way to subscribe to the "commit" stream, after the transaction commits, Coordinator also through the anchoring/acking mechanism to confirm that the commit phase has been completed, after receiving an ACK, Mark the transaction as complete on the zookeeper.

  Coordinator only one, emitter can have multiple instances depending on the degree of parallelism

  Transaction internal processing flowchart

  

3) Transactional bolt inheritance Basetransactionalbolt,

tuples, which processes batch together, calls the Execute method for each tuple call, and calls the Finishbatch method when the entire batch processing (processing) completes. If Batchbolt is marked as Committer, the Finishbolt method can only be called in the commit phase. The commit phase of a batch is guaranteed by storm to be executed only after the previous batch has been successfully submitted. And it will retry until all the bolts inside the topology commit. So how to know the processing of batch is complete, that is, Bolt received processing all the tuple in batch, inside the bolt, there is a coordinatedbolt model.

Coordinatebolt specific principles are as follows:

Coordinatebolt specific principles are as follows:

    • A coordinatebolt is encapsulated outside the bolt that actually performs the calculation. The bolt that really performs the task is called Real Bolt.
    • Each Coordinatebolt records two values: What tasks sent me a tuple (according to topology's grouping information); Which tuple I want to send messages to (also based on groping information)
    • After the Real bolt emits a tuple, its outer coordinatebolt records which task the tuple sends to.
    • After all the tuples have been sent, Coordinatebolt tells all of the task that it sent a tuple through another special stream, which sends the number of tuples to the task Emitdirect. The downstream task compares this number to the number of tuples that it has received, and if it is equal, it means that all the tuples have been processed.
    • Downstream Coordinatebolt will repeat the above steps to inform the downstream.

One application of transactional topologies in Storm is Trident, which is to do a higher level of abstraction on storm primitives and transactional basis to achieve consistency and exactly once semantics, and subsequent chapters will analyze Trident.

How Storm Batch transactions work

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.