Twitter Storm: Introduction to Transactional Topolgoy

Source: Internet
Author: User

Overview

Storm ensures that each tuple is processed at least once to provide reliable data processing. The most frequently asked question about this is: "Since tuple may be overwritten for launch (replay), how can we make statistics on the number of data records on storm? Storm may count repeatedly, right ?"

Storm 0.7.0 introduces Transactional Topology, which ensures that each tuple is "processed only once", so that you can implement an extremely accurate and scalable solution, in addition, it is highly fault tolerant to implement counting applications.

Similar to Distributed RPC, transactional topology is not actually a feature of storm. It is actually a feature abstracted by storm's underlying primitives such as spout, bolt, topology, and stream.

This article explains how transactional topology is an abstraction, how to use its APIs, and some details about its implementation.

Concept

Let's establish the abstraction of transactional topology step by step. First, we propose a simple abstraction method, and then improve it step by step. Finally, we will introduce the abstraction method used in storm code.

First design: the simplest abstract Method

The core concept behind transactional topology is to provide a strong sequence of data processing. This kind of strong sequence is the simplest manifestation. It is also our first design: We only process one tuple at a time. Unless this tuple is processed successfully, we will not process the next tuple.

Each tuple is associated with a transaction id. If the tuple fails to be processed and needs to be overwritten, it will be re-transmitted-and the same transaction id will be attached. The trasaction id mentioned here is actually a number. When a tuple is used, it is incremented by one. Therefore, the transaction id of the first tuple is 1, the transaction id of the second tuple is 2, and so on.

The strong sequence of tuple allows us to implement the semantics of "one time and only one time" even when tuple is resending. Let's take an example:

For example, you want to unify the total number of tuple in a stream. To ensure the accuracy of the statistics, you must not only save the number of tuple in the database, but also save the latest transaction id corresponding to this number. When your code is going to update this number in the database, you need to judge that the new transaction id is updated only when it is different from the transaction id saved in the database. Consider two situations:

  • The transaction id in the database is different from the current transaction id. Because of the strong sequence of transaction, we know that the current tuple is definitely not counted in the database. Therefore, we can safely increment this number and update this transaction id.
  • The transaction id in the database is the same: we know that the current tuple has been included in the database, so we can ignore this update. This tuple must have failed to be reported to storm after updating the database (such as ack timeout ).

This logic and the strong sequence of transactions Ensure that the number (count) in the database is accurate even when tuple is resold. This idea (save count + transaction-id) is proposed by Kafka developers in this design document.

Furthermore, this topology can update many different States in a transaction and reach the logic of "one time and only one time ". If any update fails, it will be ignored if you update it again. If you update it again, it will be accepted if you update it again. For example, if you are processing a url stream, you can update the number of forwards per url and the number of forwards per domain.

There is a big problem with this simple design, that is, you need to wait until a tuple is fully processed before processing the next tuple. This performance is very poor. This requires a large number of database calls (as long as each tuple is called by a database), and this design does not take advantage of storm's parallel computing capabilities, so its scalability is very poor.

Second Design

Compared with a simple solution that processes only one tuple at a time, a better solution is to process a batch of tuple in each transaction. Therefore, if you are creating a count application, the number of tuple of the entire batch is updated to the total number each time. If the batch fails, replay the entire batch again. Correspondingly, we don't give each tuple a transaction id, but a transaction id for the entire batch. The processing between batch and batch is strongly ordered, while the batch can be parallel internally. Below is the design diagram:

Therefore, if you process 1000 tuple per batch, your application will call the database 1000 times less. It also utilizes storm's parallel computing capability (each batch can be parallel internally)

Although this design is much better than the first one, it is still not a perfect solution. The worker in topology will spend a lot of time waiting for the rest of the computing to complete. For example, see the following calculation.

After bolt 1 completes its processing, it needs to wait for the remaining bolts to process the current batch until the next batch is launched.

Third design (design adopted by storm)

One of the most important issues we need to be aware of is that, in order to implement the transactional feature, not all work needs to be ordered when dealing with a batch of tuples. For example, when a global count application is used, the entire calculation can be divided into two parts.

  • Calculate the local quantity of the batch.
  • Update the local quantity of the batch to the database.

The second step requires strong sequence before multiple batchs, but the first step is not required, so we can perform the first step in parallel. So when the first batch updates its number to the database, 2nd to 10 batchs can start to calculate their local quantity.

Storm divides the calculation of a batch into two stages to achieve the above principle:

  • Processing Stage: Many batchs can be computed in parallel.
  • Commit stage: each batch in this stage must be guaranteed in a strong order. Therefore, the second batch must be submitted only after the first batch is successfully submitted.

These two stages are collectively called a transaction. Many batchs can perform parallel computing at any time in the processing stage, but only one batch can be in the commit stage. If a batch has any errors in the processing or commit phase, the entire transaction needs to be replaced.

Design details

When using Transactional Topologies, storm will do the following for you:

1) Management Status: Storm stores all the statuses required for implementing Transactional Topologies in zookeeper. This includes the current transaction id and metadata that defines each batch.

2) Coordinating transactions: Storm helps you manage everything and decide whether the proccessing or committing should be performed at any point in time.

3) Error Detection: Storm uses the acking framework to efficiently detect when a batch is successfully processed, submitted, or failed. Storm then replay the corresponding batch. You do not need to manually perform any acking or anchoring-storm to help you deal with everything.

4) built-in batch processing API: Storm encapsulates a layer of API on ordinary bolts to provide batch processing support for tuple. Storm manages all the coordination work, including deciding when a bolt receives all the tuple of a specific transaction. Storm also automatically clears the intermediate data generated by each transaction.

5) At last, it should be noted that Transactional Topologies requires a Message Queue system that can completely resend (replay) a specific batch of messages ). Technologies such as Kestrel cannot do this. Apache Kafka is suitable for this requirement. Storm-contrib implements storm-kafka.

A basic example

You can use TransactionalTopologyBuilder to create transactional topology. The following is a definition of transactional topology, which is used to calculate the number of tuple in the input stream. This code is from TransactionalGlobalCount in storm-starter.

12345678 MemoryTransactionalSpout spout = new MemoryTransactionalSpout(DATA, new Fields("word"), PARTITION_TAKE_PER_BATCH);TransactionalTopologyBuilder builder = new TransactionalTopologyBuilder("global-count", "spout", spout, 3);builder.setBolt("partial-count", new BatchCount(), 5).shuffleGrouping("spout");builder.setBolt("sum", new UpdateGlobalCount()).globalGrouping("partial-count");

TransactionalTopologyBuilderAccept the following parameters

  • The id of this transaction topology
  • The id of the spout in the entire topology.
  • A transactional spout.
  • An optional concurrency of the transactional spout.

The topology id is used to save the current progress of the topology in zookeeper. Therefore, if you restart the topology, it can continue with the previous progress.

A transaction topology contains a uniqueTransactionalSpout, This spout is throughTransactionalTopologyBuilder. In this example,MemoryTransactionalSpoutIt is used to read DATA from a memory variable ). The second parameter specifies the data fields. The third parameter specifies the maximum number of tuple for each batch. About how to customizeTransactionalSpoutWe will introduce it later.

Recommended reading:

Twitter Storm installation configuration (cluster) Notes

Install a Twitter Storm Cluster

Notes on installing and configuring Twitter Storm (standalone version)

Storm practice and Example 1

  • 1
  • 2
  • Next Page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.