The core data model of Storm Trident is a batch of processed "streams", "streams" in the cluster's partitions on the nodes of the cluster, and the "flow" operation is also done in parallel on each partition.
Trident has five kinds of operations on "flow":
1. Local batch operations that do not require network transport
2. "Redistribution" operation that requires network transmission, does not change the contents of the data
3. Aggregation operation, network transport is part of the operation
4. Flow grouping (Grouby) operation
5. Merging and associating operations
Batch local operations:
Batch local operations do not require network transport, and the operations of this partition (Partion) are independent of each other
Functions (function)
The function receives some fields and emits (emit) 0 or more "tuples". The output field is appended to the original tuple. If a function does not emit data, the original data is filtered. If multiple tuples are emitted (emit), the original tuple is duplicated redundant to the output tuple. For example:
Public class extends basefunction { publicvoid Execute (tridenttuple tuple, tridentcollector collector {for (int i=0; i < Tuple.getinteger (0); i++) { collector.emit ( New Values (i)); }}}
Suppose there is an input stream called "MyStream" with ["a", "B", "C"] three fields
[1, 2, 3] [4, 1, 6] [3, 0, 8] |
If you run the following code:
Mystream.each (new New new fields ("D")))
The result of the run will be 4 fields ["A", "B", "C", "D"], as follows:
[1, 2, 3, 0] [1, 2, 3, 1] [4, 1, 6, 0] |
Filters (Filter)
Filters receives a tuple (tuple) that determines whether the tuple needs to be persisted. Like what:
Public class extends basefunction { public booleaniskeep (tridenttuple tuple) { return Tuple.getinteger (0) = = 1 && tuple.getinteger (1) = = 2; }}
Suppose you have the following inputs:
[1, 2, 3] [2, 1, 1] [2, 3, 4] |
Run the following code:
Mystream.each (newnew myfilter ())
The results will be as follows:
Zonal aggregation
Partition rollups are functions that run on partitions on each batch, and unlike functions, the partition Rollup emission (emit) data overwrites the original tuple. Consider the following example:
Mystream.partitionaggregate (new New new fields ("sum"))
Assume that the input "stream" contains "a", "B" "two fields, and the following partitions
Partition 0: ["A", 1] ["B", 2] Partition 1: ["A", 3] ["C", 8] Partition 2: ["E", 1] ["D", 9] ["D", 10 |
The flow of the output will contain only one field called "Sum":
Partition 0: [3] Partition 1: [11] Partition 2: [20] |
There are three different kinds of aggregation interfaces defined here: Combineraggreator,reduceaggregator and aggregate.
Combineraggregator:
Public Interface extends Serializable { T init (tridenttuple tuple); T Combine (t val1, T val2); T zero ();}
a Combineraggregator Returns a separate tuple with a field for that tuple value. Combineraggregator runs init on each tuple, using combine to federate the result until there is only one tuple remaining. If the batch has no data, run the zero function. For example, the following implements a count
Public class Implements Combineraggregator<long> { public Long init (tridenttuple tuple) { return 1L; } Public long Combine (long val1, long val2) { return val1 + val2; } Public Long Zero () { return 0L; }}
The Rduceraggregator interface is as follows: as you can see, the benefits of Combineraggregator use aggregate functions instead of partition aggregation. In this case, Trident is automatically optimized to make a partial summary before the network is transmitted. (similar to the combine of MapReduce).
Public Interface extends Serializable { T init (); t reduce (t Curr, tridenttuple tuple);}
Rduceraggregator generates a value at initialization time, each input tuple iterates over the value and outputs a single value, such as the following defines a count of Reduceaggegator:
Public class Implements Reduceraggregator<long> { public Long init () { return 0L; } Public long reduce (long curr, tridenttuple tuple) { return Curr + 1; }}
Reduceraggregator can be used with persistentaggregate, as will be mentioned later.
The more general aggregation interface is aggregator, as follows:
Public Interface extends Operation { T init (Object BatchId, tridentcollector collector); Voidaggregate (T state, tridenttuple tuple, tridentcollector collector); Voidcomplete (T State, Tridentcollector collector);}
1. The init function is called before performing a batch operation and returns a state object that will be passed into the aggregate and complete functions. aggregator can emit any number of output tuples, which can contain more than one field (fields). Output can be emitted at any point in the execution process. The aggregation is performed in the following manner:
2. Aggregate will be called for each tuple in the batch, and this method can also emit (emit) a tuple with the new state.
3. Call the complete function when the data for this batch partition is executed.
The following is a count that uses aggregate matters
Public classCountaggextendsBaseaggregator<countstate> { Static classCountstate {LongCount = 0; } Publiccountstate Init (Object BatchId, tridentcollector collector) {return Newcountstate (); } Publicvoidaggregate (countstate State, tridenttuple tuple, tridentcollector collector) {State.count+=1; } PublicVoidcomplete (countstate State, Tridentcollector collector) {Collector.emit (NewValues (State.count)); }}
If you want a colleague to perform multiple aggregations, you can use the following call chain
Mystream.chainedagg () . Partitionaggregate(newnew fields ("Count")) . Partitionaggregate(new New new fields ("sum")) . Chainend ()
This code will execute the count and sum aggregations on each partition. The output will contain "count", "Sum" field.
Statequery and Partitionpersist
Statequery and partitionpersistent Query and follow the new status. Refer to Trident State doc. Https://github.com/nathanmarz/storm/wiki/Trident-state
Projection (projection)
The projection operation is to crop columns on the data, and if you have a stream with "a", "B", "C", "D" "Four fields, execute the following code:
MyStream. Project(New fields("B","D")) |
The output stream will have only "B", "D", and two fields.
Re-partitioning (repartition) operation
A repartitioning operation is a function that alters the distribution of tuples (tuple) between tasks. You can also adjust the number of partitions (for example, if the concurrent hint becomes larger after repartition), the repartitioning (repatition) requires network transmission. , the line face is a partition function:
1. Shuffle: Using a random algorithm to select a partition in the target partition to send data
2. Broadcast: Each tuple is sent repeatedly to all target partitions. This is useful in DRPC. If you want to do a statequery on each partition.
3. Paritionby: Make a semantic partition based on a series of distribution fields. The target partition is obtained by taking the hash value of these fields and taking the target partition number modulo. Paritionby guarantees that the same distribution field (fields) is distributed to the same destination partition.
4. Global: All the tuples are distributed to the same partition. All batches of this partition are the same.
5. Batchgobal: All the tuples in this batch are sent to the same partition, the non-pass batches can be in the non-pass partition.
6. Patition: This function accepts user-defined partition functions. User-defined function matters backtype.storm.grouping.CustomStreamGrouping interface.
Aggregation operations
Trident have aggregate and persistentaggregate function convection do aggregation. The aggregate runs independently on each batch, persistentaggregate all batches of the confluence and stores the results.
To do a global aggregation on a stream, you can use Reduceceraggregator or Aggretator, which is first partitioned into a partition, and then the aggregation function runs on that partition. If you use combineraggreator,trident virtuous to make a partial summary on each partition, then the partition is flushed to a partition, and the aggregation is completed after the network transfer is over. Combineraggreator is very effective and can be used more often than possible.
Here is an example of an aggregation in batches:
MyStream. Aggregate(New Count(), New fields("Count")) |
As with Partitionaggregate, the aggregate of aggregation can also be concatenated. If Combineraggreator and non-combineraggreator are concatenated, the Trident cannot be optimized for partial summarization.
Stream grouping operations
GroupBy operations are re-partitioned according to the special field convection, the same tuple (tuple) of the grouping field is divided into the same partition, here is an example of GroupBy:
If the grouped streams are aggregated, the party aggregates each group instead of the batch. (Same as GroupBy for relational databases). Persistentaggregate can also be run in a grouped stream of Hassan, in which case the results will be stored in the mapstate, and key is a grouping field. You can view the https://github.com/nathanmarz/storm/wiki/Trident-state.
As with normal aggregations, the aggregation of packet streams can also be concatenated.
Merging and associating
The last part of the API is to merge the non-flow streams, and the simplest way is to merge (Meger) multiple streams into one stream. You can use Tridenttopology#meger, as follows:
Topology. Merge(stream1, stream2, stream3); |
Trident a merged stream field is named after the field of the first stream.
The other way to merge flows is join. SQL-like joins are for fixed input. The input of the stream is not fixed, so it is not possible to join as a SQL method. Joins in Trident will only be performed in each batch issued by spout.
Here is an example of a join, a stream containing the field "" Key "," Val1 "," Val2 ", and another stream containing the field" "X", "Val1" ":
Topology. Join(stream1,new fields("Keys"), stream2, New fields( "x"),new fields("key","A","B","C")); |
Stream1 "Key" is associated with the "X" of stream2. Trident requires all fields to be named, because the original name will be overwritten. The input of the join will contain:
1. First, the join field. In the example, "key" in stream1 corresponds to "x" in Stream2.
2. Next, the non-join fields are listed sequentially, in the order in which they are passed to the join. In the example "a", "B" corresponds to "val1" and "Wal2" in Stream1, "C" corresponds to "Val1" in stream2.
When the flow of the join comes from the spout of the non-pass, these spout synchronize the emitted batches, that is, the batch processing will contain each tuple emitted by the spout.
One might ask how to do the "Windowedjoin", the join operation on the side of joins and the last one hours of data on the other side.
To achieve this, you can use Patitionpersist and statequery. The last one hours of data can be changed according to the key stored in the join field, during the join process can query the stored amount of data to complete the join operation.
Fixedbatchspout spout=New fixedbatchspout (new fields ("sentence"), 3, new VALUES ("The cow jumped over the moon"), new values ("The man went to the store and bought some candy" ), new values ("Four score and seven years ago"), new values ("How Many apples can eat "), new Values (" To is or not to is the person "); Spout.setcycle (true);
Or:
Brokerhosts brokerhosts =Newzkhosts (Configfactory.getconfigprops (). getString (Configprops.key_zookeeper_kafka)); Tridentkafkaconfig Kafkaconfig=NewTridentkafkaconfig (brokerhosts, Configprops.topic_user, "PV"); Transactionaltridentkafkaspout Kafkaspout=Newtransactionaltridentkafkaspout (kafkaconfig); Tridenttopology topology=Newtridenttopology (); Tridentstate tridentstate= Topology.newstream ("SPOUT1", spout). Parallelismhint (16). each (NewFields ("sentence"),NewSplit (),NewFields ("Item"). each (NewFields ("Item"),Newlowercase (),NewFields ("word")) //Enter the item lowcase operation to output word. GroupBy (NewFields ("word"))//output Word aggregation based on previous step. Persistentaggregate (NewMemorymapstate.factory (),NewCount (),NewFields ("Count")) //Aggregation and persistence. Parallelismhint (6); Topology.newdrpcstream ("Words", LOCALDRPC). each (NewFields ("args"),NewSplit (),NewFields ("word"). GroupBy (NewFields ("word"). Statequery (Tridentstate,NewFields ("word"),NewMapget (),NewFields ("Count")) //tridentstate Output as input source. each (NewFields ("Count"),Newfilternull ()). Aggregate (NewFields ("Count"),NewSum (),NewFields ("Sum")); returnTopology.build ();
Storm Trident Example