Storm Getting Started tutorial chapter fifth consistency Transaction "go"

Source: Internet
Author: User
Tags emit

Storm is a distributed stream processing system that uses anchor and ACK mechanisms to ensure that all tuples are successfully processed. If a tuple is faulted, it can be re-transmitted, but how do you ensure that the wrong tuple is processed only once? Storm provides a set of transactional components transaction topology to solve this problem.

Transactional topology is no longer maintained and is implemented by Trident for transactional topology, but the principle is the same.

5.1 Design of the consistency transaction

How storm implements the parallel processing of tuples and guarantees transactional. This section starts with the simple transactional implementation method, and gradually draws out the principle of transactional topology.

5.1.1 Simple design One: Strong sequential flow

The simplest way to ensure that a tuple is processed only once is to turn the tuple stream into a strong sequence and process only one tuple at a time. Starting with 1, each tuple is sequentially prefixed with an ID. When a tuple is processed, a successful tuple ID is processed and the results of the calculation are present in the database. When the next tuple arrives, compare its ID with the ID in the database. If the same, then the tuple has been successfully processed, ignoring it, if different, according to the strong order, the tuple has not been processed, its ID and calculation results updated to the database.

Take the total count of messages as an example. For each tuple, if the ID stored in the database differs from the current tuple ID, the total number of messages in the database is 1 plus the current tuple ID value in the database is updated.

However, this mechanism allows the system to process only one tuple at a time and cannot achieve distributed computing.

5.1.2 Simple Design II: Strong Sequential Batch Flow

In order to achieve distribution, we can process a batch of tuple each time, called a batch. A tuple in a batch can be processed in parallel.

We want to make sure that a batch is processed only once, and the mechanism is similar to the previous section. Only the batch ID is stored in the database. The intermediate calculation results for batch exist in a local variable, and when all the tuples in a batch are processed, the batch ID is determined, and if it is different from the ID in the database, the intermediate results are updated to the database.

How do you ensure that all the tuples in a batch are processed? You can take advantage of the coordinatebolt provided by storm.


However, strong sequential batch streams are also limited, and can only handle one batch,batch at a time and cannot be parallelized. To achieve true distributed transactions, you can use the transactional topology provided by storm. Before we do this, let's introduce the principle of coordinatebolt in detail.

5.1.3 Coordinatebolt principle

Coordinatebolt specific principles are as follows:

    • A coordinatebolt is encapsulated outside the bolt that actually performs the calculation. The bolt that really performs the task is called Real Bolt.
    • Each Coordinatebolt records two values: What tasks sent me a tuple (according to topology's grouping information); Which tuple I want to send messages to (also based on groping information)
    • After the Real bolt emits a tuple, its outer coordinatebolt records which task the tuple sends to.
    • After all the tuples have been sent, Coordinatebolt tells all of the task that it sent a tuple through another special stream, which sends the number of tuples to the task Emitdirect. The downstream task compares this number to the number of tuples that it has received, and if it is equal, it means that all the tuples have been processed.
    • Downstream Coordinatebolt will repeat the above steps to inform the downstream.

The whole process:

Coordinatebolt is mainly used in two scenarios:

    • Drpc
    • Transactional topology

Coordinatedbolt for business is intrusive, to use the features provided by Coordinatedbolt, you must ensure that each of your bolts sends the first field of each tuple is request-id. The so-called "I have finished my upstream" means that the current bolt has done the work that the Request-id needs to do. This request-id represents a DRPC request in Drpc, and represents a batch in transactional topology.

5.1.4 trasactional Topology

Storm provides transactional topology to divide batch calculations into process and commit two phases. The process phase can handle multiple batches at the same time, without guaranteeing order, and the Commit phase guarantees batch order, and only one batch is processed at a time, and the 2nd batch cannot be committed until the 1th batch is successfully committed.

For example, the following code is from the Transactionalglobalcount in Storm-starter.

memorytransactionalspout spout = new Memorytransactionalspout (DATA, new fields ("word"), Partition_take_per_batch);

Transactionaltopologybuilder builder = new transactionaltopologybuilder ("global-count", " Spout", spout, 3);

Builder.setbolt ("partial-count", new Batchcount (), 5). nonegrouping ("spout");

Builder.setbolt ("sum", new Updateglobalcount ()). Globalgrouping ("partial-count");

The Transactionaltopologybuilder receives four parameters altogether.

    • The ID of this transactional topology. The ID is used to save the progress of the current topology in zookeeper, and if the topology is restarted, the previous progress execution can continue.
    • The ID of spout in this topology
    • A transactionalspout. There can only be one trasactionalspout in a trasactional topology. In this case, it is a memorytransactionalspout that reads data from a memory variable (data).
    • Transactionalspout the degree of parallelism (optional).

Here is the definition of Batchcount:

Public Static class Batchcount extends basebatchbolt {

Object _id;

Batchoutputcollector _collector;

int _count = 0;

@Override

Public void Prepare (Map conf, Topologycontext context,

Batchoutputcollector collector, Object ID) {

_collector = collector;

_id = ID;

}

@Override

Public void Execute (tuple tuple) {

_count++;

}

@Override

Public void Finishbatch () {

_collector.emit (new Values (_id, _count));

}

@Override

Public void declareoutputfields (outputfieldsdeclarer declarer) {

Declarer.declare (New fields ("ID", "Count"));

}

}

The last parameter of the Batchcount prepare method is the batch ID, which is a Transactionattempt object in transactional Tolpoloyg.

The tuple sent in transactional topology must be transactionattempt as the first field,storm based on this field to determine which batch the tuple belongs to.

Transactionattempt contains two values: one transaction ID, one attempt ID. The role of the transaction ID is that what we have described above is unique to each tuple in each batch, and no matter how many times the batch replay is the same. Attempt ID is a unique ID for each batch, but for the same batch, it replay after the attempt ID is not the same as replay before, we can interpret the attempt ID as replay-times, Storm uses this ID to differentiate between different versions of a batch-fired tuple.

The Execute method executes once for each tuple in batch, and you should keep the calculation state in the batch inside a local variable. For this example, it increments the number of tuples in the Execute method.

Finally, when the bolt receives all the tuples of a batch, the Finishbatch method is called. The Batchcount class in this example will emit its local quantity into its output stream at this time.

The following is the definition of the Updateglobalcount class:

Public Static class Updateglobalcount extends Basetransactionalbolt

Implements Icommitter {

Transactionattempt _attempt;

Batchoutputcollector _collector;

int _sum = 0;

@Override

Public void Prepare (Map conf, Topologycontext context,

Batchoutputcollector collector, transactionattempt attempt) {

_collector = collector;

_attempt = attempt;

}

@Override

Public void Execute (tuple tuple) {

_sum+=tuple.getinteger (1);

}

@Override

Public void Finishbatch () {

Value val = DATABASE. Get (global_count_key);

Value newval;

if (val = = null | |!val.txid.equals (_attempt.gettransactionid ())) {

newval = new Value ();

Newval.txid = _attempt.gettransactionid ();

if (val==null) {

Newval.count = _sum;

} Else {

Newval.count = _sum + val.count;

}

DATABASEput (global_count_key, newval);

} Else {

newval = val;

}

_collector.emit (new Values (_attempt, Newval.count));

}

@Override

Public void declareoutputfields (outputfieldsdeclarer declarer) {

Declarer.declare (New fields ("ID", "sum"));

}

}

Updateglobalcount implements the Icommitter interface, so Storm will only execute the Finishbatch method in the commit phase. The Execute method can be completed at any stage.

In Updateglobalcount's Finishbatch method, compare the current transaction ID with the ID stored in the database. If the same, the batch is ignored, and if different, the results of this batch are added to the total results and the database is updated.

Transactional Topolgy Run as follows:

Here's a summary of some of the features of transactional topology

    • Transactional topology encapsulates the transactional mechanism, using coordinatebolt internally to ensure that a tuple in a batch is processed.
    • Transactionalspout can only have one, it divides all tuples into one batch, and guarantees that the transaction ID of the same batch is always the same.
    • Batchbolt handles the tuples of batch together. The Execute method is called for each tuple, and the Finishbatch method is called when the entire batch processing is complete.
    • If Batchbolt is marked as Committer, the Finishbolt method can only be called in the commit phase. The commit phase of a batch is guaranteed by storm to be executed only after the previous batch has been successfully submitted. And it will retry until all the bolts inside the topology commit.
    • Transactional topology hides the Anchor/ack framework, which provides a different mechanism to fail a batch, so that the batch is replay.
5.2 Trident Introduction

Trident is a high-level abstraction above storm, providing interfaces such as joins,grouping,aggregations,fuctions and filters. If you have used pig or cascading, you will not be unfamiliar with these interfaces.

Trident the tuples in the stream into batches for processing, the API encapsulates the processing of these batches, ensuring that the tuple is processed only once. Processing batches intermediate results are stored in the Tridentstate object.

Trident transactional principles are not described in detail here, interested readers please consult their own information.

Reference: http://xumingming.sinaapp.com/736/twitter-storm-transactional-topolgoy/

http://xumingming.sinaapp.com/811/twitter-storm-code-analysis-coordinated-bolt/

Https://github.com/nathanmarz/storm/wiki/Trident-tutorial

Mu Yi

Storm Getting Started tutorial chapter fifth consistency Transaction "go"

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.