Before looking at this blog post, it is recommended to check
How storm Batch transactions work
Why batching (Batch)?
Process individual tuples one by one, adding a lot of overhead, such as writing libraries and outputting results too often
Transaction processing single tuple efficiency is low, so the batch processing is introduced into storm
Batches are processed one batch (batch) tuple at a time, and the transaction ensures that the batch is processed successfully, and if there are no processing failures, Storm will resend the failed batches and ensure that each batch is and is processed only once
There are three types of spout:
the following were:
1. Itransactionalspout<T>, with Basetransactionalspout<t>, General Affairs spout
2.ipartitionedtransactionalspout<t>, with Basepartitionedtransactionalspout<t>, partition transaction spout
3.iopaquepartitionedtransactionalspout<t>: Same as Baseopaquepartitionedtransactionalspout<t>, opaque partitioned transaction spout
There are two types of bolts
1.ibatchbolt<t>: with Basebatchbolt<t>, ordinary batch processing
2.BaseTransactionalBolt: Transaction Bolt
Implement Interface Icommitter: identifies whether Ibatchbolt or Basetransactionalbolt is a committer coordinatedbolt
Spout
1.Spout: General Affairs Spout Itransactionalspout
Interface Itransactionalspout.coordinator<x>:
Method Summary |
void |
close() |
X |
initializeTransaction(java.math.BigInteger txid, X prevMetadata) Creates a new metadata that, when IsReady () is true, emits the metadata (transaction tuple) to thebatch emit stream |
boolean |
isReady() 1. 返回时true,开启一个事务进入processing阶段,发射一个事务性的tuple到batch emit流,Emitter以广播方式订阅Coordinator的batch emit流 |
Interface itransactionalspout.emitter<x>
Method Summary |
void |
cleanupBefore(java.math.BigInteger txid) Cleanup information for previous transactions |
void |
close() |
void |
emitBatch(TransactionAttempt tx, X coordinatorMeta, BatchOutputCollector collector) 2.Emitter receives this transaction tuple, the batch tuple is launched, and batch of tuple is fired individually |
2.Spout: Partition transaction ipartitionedtransactionalspout<T>
Partition transaction spout , mainstream affairs spout because the current mainstream message Queue partitions are supported, the role of partitioning is to increase the MQ (each partition as a data source send point), the mainstream MQ such as Kafka , ROCKETMQ
Interface Ipartitionedtransactionalspout.coordinator
Method Summary |
void |
close() |
boolean |
isReady() Returns true to open a transaction into the processing phase, launching a transactional tuple-to-batch emit stream, emitter subscribing to coordinator batch emit stream in a broadcast manner |
int |
numPartitions() returns the number of partitions. When a new data source partition is added and a transaction is replayed , the new partition's tuples is not emittedbecause it knows how many partitions are in the transaction. |
Interface ipartitionedtransactionalspout.emitter<x>
Summary |
void |
close() |
void |
emitPartitionBatch(TransactionAttempt tx, BatchOutputCollector collector, int partition, X partitionMeta) . if the message bolt consumption fails, Emitpartitionbatch is responsible for re-sending the message. |
X |
emitPartitionBatchNew(TransactionAttempt tx, BatchOutputCollector collector, int partition, X lastPartitionMeta) launch a new batch and return to metadata. |
3.Spout: Opaque partitioned transaction Spout
Interface Iopaquepartitionedtransactionalspout
Method Summary |
void |
close() |
boolean |
isReady() Ditto |
Iopaquepartitionedtransactionalspout it does not distinguish between sending new messages or re-sending old messages, all with Emitpartitionbatch. Although the x returned by Emitpartitionbatch should be the next batch for your own use (the 4th parameter of Emitpartitionbatch), only one batch will be updated to zookeeper, and if it fails to be re-sent, The x that emitpartitionbatch reads is still old. So this time the custom x does not need to record the start position of the current batch and the start position of the next batch two values, only need to record the next batch of starting position a value can be, for example:
public class Batchmeta {
Public long nextoffset;//offset for next batch
}
Ipartitionedtransactionalspout and Iopaquepartitionedtransactionalspout
is to package the tuple into batch for processing, and to ensure that each tuple is processed completely, support message re-send. To support transactional, they provide a unique transaction ID (transaction ID:TXID) for each batch (batch), TXID is sequentially incremented, and the processing of batches is guaranteed to be strong-ordered, that is, the txid=1 must be fully processed before the txid=2 can be processed again.
The difference and usage:
Each tuple of a ipartitionedtransactionalspout is bound to a fixed batch. No matter how many times a tuple is re-sent, it has the same transaction ID in the same batch, and a tuple does not appear in more than two batches. No matter how many times a batch is re-sent, it has only one and the same transaction ID, and does not change. This means that, regardless of how many times a batch is re-sent, it contains exactly the same content.
But Ipartitionedtransactionalspout will have a problem, although this problem is very rare: Suppose a batch of messages in the bolt consumption process failed, need to spout, at this time if you happen to encounter a message sent middleware failure, such as a partition is unreadable, spout in order to ensure that each batch contains the same tuple, it can only wait for the message middleware to recover, that is, the card can no longer be sent to the bolt message until the message middleware recovery. Iopaquepartitionedtransactionalspout can solve this problem.
and iopaquepartitionedtransactionalspout in order to solve this problem, it does not guarantee that each time a batch is re-sent the message contains a tuple of exactly the same. This means that a tuple may appear for the first time in a batch of txid=2, and may appear later in a txid=5 batch. This situation only occurs when a batch of message consumption fails to be re-sent and happens when the message middleware fails. At this point, iopaquepartitionedtransactionalspout is not waiting for the message middleware to recover, but to read the partition first. For example, the txid=2 batch failed in the consumption process, need to be re-sent, it happens that the message middleware 16 partitions have 1 partitions (partition=3) because the fault is unreadable. At this time Iopaquepartitionedtransactionalspout will read the other 15 partitions, complete the txid=2 this batch of send, this time the same batch actually contains fewer tuple. Assuming the failure of the message middleware is restored at txid=5, the tuple that was previously in txid=2 and partition=3 in the partition is re-sent, contained in the txid=5 batch.
When using Iopaquepartitionedtransactionalspout, because the corresponding relationship between the tuple and the TXID is likely to change, it is not guaranteed to be transactional to keep a txid with the business calculation results. This time the solution is a little more complicated, in addition to saving the business calculation results, there are two elements to save: the previous batch of business calculations and the transaction ID for this batch.
Let's take a simpler example of how to calculate global count, assuming that the current statistical results are:
{value = 4,
Prevvalue = 1,
TXID = 2
}
BOLT
I. Interface ibatchbolt<t> general Batch Processing
Method Summary |
void |
execute(Tuple tuple) Execute batch inside to process each tuple |
void |
finishBatch() Finish processing a batch call |
void |
prepare(java.util.Map conf, TopologyContext context, BatchOutputCollector collector, T id) Finish processing a batch call |
Two. Class Basetransactionalbolt Transaction Batch Processing
The only difference between a transactional batch and a normal batch:
the fourth parameter of a prepare
Ibatchbolt prepare
里面的最后一个参数
is the object type, and the transaction is related to Transactionattempt
A detailed description of the Storm batch transaction API