Reprinted from the Concurrent Programming network –ifeve.com This article link address: Apache Storm Official document--trident spouts
Like the general Storm API, spout is also the source of data for Trident topologies. However, in order to achieve more complex functional services, Trident Spout provides additional API interfaces on top of the normal Storm Spout.
Data sources, data streams, and operations that update state (such as a database) based on data flow are unavoidable. The Trident state article has a detailed explanation in this regard, and understanding the connection between them is important to understand how spout works.
Most of the spout in the Trident topology are non-transactional spout. In a Trident topology, you can use a normal IRichSpout
interface to create a data flow:
New tridenttopology (); Topology.newstream (new myrichspout ());
All spout in the Trident topology must have a unique identity, and the identity must be unique across the Storm cluster. Trident needs to use this identity to store spout metadata (metadata) consumed from ZooKeeper, including Txid and other related spout metadata.
You can use the following configuration items to set up ZooKeeper addresses for storing spout metadata (typically, you don't need to set the following options, because Storm defaults to using the cluster's ZooKeeper server to store data--the translator note):
transactional.zookeeper.servers
: ZooKeeper List of servers
transactional.zookeeper.port
: Port for ZooKeeper cluster
transactional.zookeeper.root
: The root directory where metadata is stored in ZooKeeper. The metadata is stored directly in the settings directory.
Pipeline
By default, Trident processes only one batch at a time, knowing that the batch process succeeds or fails before it starts processing other batches. You can use batch pipelining to increase throughput and reduce processing latency for each batch. The maximum number of batches processed concurrently can be topology.max.spout.pending
configured by.
However, even with multiple batches at the same time, Trident updates the state in the order of batch. For example, if you are working on a task that consolidates and updates the global count results to a database, you can continue to handle BATCH2, BATCH3, and even batch10 counts as you update the BATCH1 count results to the database. However, Trident will only process the state update operation for subsequent batches after the BATCH1 state update is complete. This is the necessary basis for implementing the semantics of just-in-time processing, which we have discussed in the Trident State article.
Trident spout Type
Some of the available spout API interfaces are listed below:
- Itridentspout: This is the most common API that supports semantic implementations of transactional and fuzzy transactional types. However, it is common to use one of its existing implementations as needed, rather than implementing the interface directly.
- Ibatchspout: A non-transactional spout that outputs a batch tuple at a time.
- Ipartitionedtridentspout: transactional spout that can read data from a distributed data source, such as a cluster or Kafka server.
- Opaquepartitionedtridentspout: A fuzzy transactional type spout that can read data from a distributed data source.
Of course, as mentioned at the beginning of this tutorial, in addition to these APIs, you can also use normal IRichSpout
.
Apache Storm Official Document--trident spouts