For a fault-tolerant mechanism, storm uses a system-level component Acker, combined with an XOR check mechanism, to determine whether a tuple is sent successfully, and then spout to resend the tuple, ensuring that a tuple is re-sent at least once in the case of an error.
But when you need to accurately count the number of tuples, such as sales scenarios, and want each tuple to be and be processed only once, Storm 0.7.0 introduces transactional topology, which guarantees that each tuple is and is processed only once, This allows us to implement a very accurate and highly fault-tolerant way to implement the counting class application.
Processing individual tuples one by one, adding a lot of overhead, such as writing a library, the output frequency is too high
Things are less efficient at handling single tuples, so batch processing is introduced in storm.
Things can ensure that the batch is processed successfully, and if there are no processing failures, Storm will resend the failed batches and ensure that each batch has and is processed only once.
The principle of things mechanism:
For the need to handle only once, from the principle, need to send a tuple when the thing Id:txid, when the need for things to deal with, depending on whether the TXID has been processed successfully before deciding whether to deal with, of course, need to txid and processing results together to save. and the Order of guarantee is required, before the current request Txid commits, all the lower TXID requests are submitted.
In the case of batch processing, a batch of tuple gives a txid, in order to improve the parallelism between batch processing, Storm uses pipeline (pipeline) processing model, so many things can be executed in parallel, but commit is in strict order.
In storm thing processing, the calculation of a batch is divided into two stages processing and commit phases:
Processing stage: Multiple batch can be calculated in parallel;
commiting phase: Batch commits are enforced in order.
Things topo:
Processing stage: Multiple batch can be computed in parallel, for example, BOLT2 is a common Batchbolt (ibatchbolt), then multiple batch can be executed in parallel between bolt2 tasks.
Commiting phase: Batch is forced to commit in sequence, such as BOLT3 implementation Ibatchbolt and tagging requires things to do (Icommitter interface is implemented, Or add Batchbolt to topology by Transactionaltopologybuilder's Setcommitterbolt method), then storm thinks it is possible to call Finishbatch when it can submit batch. , in Finishbatch do txid comparison and state preservation work.
When using transactional topologles, Storm will do the following for you:
Management status: Storm maintains all the necessary states to implement transactional topologies in zookeeper, including the current transaction ID and some meta data for each batch;
Coordinating transactions: Storm helps you manage everything, such as helping you decide whether to be the processing or the committing at any point in time.
Error detection: Storm uses the acking framework to efficiently detect when a batch has been successfully processed, successfully committed, or failed. Storm then replay the corresponding batch accordingly. You do not need to do any acking or anchoring (the action that occurs emit) manually.
The built-in batching Api:storm wraps a layer of APIs on top of ordinary bolts to provide batch support for tuple. Storm manages all the coordination work, including deciding when a bolt receives all the tuple,storm of a particular transaction and automatically cleans up the intermediate data generated by each transaction.
The spout of a thing needs to implement Itransactionalspout, which contains two internal class interface classes Coordinator and Emmiter. When topology is running, the transactional spout contains a sub-topology inside.
There are two types of tuple, one is a transactional tuple, and the other is a tuple in batch;
Coordinator open a transaction ready to launch a batch, enter the processing phase of a transaction and emit a transactional tuple (transactionattempt & Metadata) to the batch emit stream
Emitter subscribes to Coordinator's "batch emit" stream as all grouping, responsible for actually transmitting a tuple for each batch, The sent tuple must be transactionattempt as the first field,storm based on this field to determine which batch the tuple belongs to.
Coordinator only one, Emmitter can have multiple instances depending on the degree of parallelism
Transactionattempt contains two values: one transaction ID, one attempt ID.
The role of the transaction ID is that what we said above is unique for each tuple in each batch, regardless of how many times the batch replay is the same.
Attemp ID is a unique ID for each batch, but for the same batch, it replay after the attempt ID and replay before the same.
We can interpret the attempt ID as replay-times,storm use this ID to differentiate between different versions of a tuple sent by a batch.
Metadata (metadata) contains the point at which the current transaction can replay data, stored in zookeeper, spout can serialize and deserialize the metadata from zookeeper by Kryo
Transaction bolts:
Basetransactionalbolt
tuples, which processes batch together, calls the Execute method for each tuple, and the Finishbatch method is called when the entire batch processing (processing) completes. If Batchbolt is marked as Committer, the Finishbatch method can only be called in the commit phase. The commit phase of a batch is guaranteed by storm to be executed only after the previous batch has been successfully submitted. And it will retry until all the bolts inside the topology commit. So how is it until the processing of batch is finished, that is, Bolt accepts all the tuples in batch: Inside the bolt, there is a coordinatedbolt model.
Coordinatebolt:
Each Coordinatebolt records two values: What tasks sent me a tuple (according to topology's grouping information); Which task I send information to (also based on grouping information).
After all the tuples have been sent, Coordinatorbolt tells all the task that it sent a tuple by another special stream in Emitdirect, which sends the number of tuples to the task. The downstream task compares this number to the number of tuples that it has received, and if it is equal, it means that all the tuples have been processed.
Downstream Coordinatebolt will repeat the above steps to inform the downstream.