Translation [Trident] Storm Trident Detailed Introduction

Source: Internet
Author: User
Tags cassandra emit shuffle

What does 1.Trident have to offer storm power?
2.Trident is very intelligent in how to maximize the performance of performing Topogloy?
3.storm how to ensure that each message is processed once?






Trident is a high abstraction based on Storm, a goal of realtime computing. It provides the ability to handle large throughput data while providing low-latency distributed queries and stateful streaming. If you are familiar with the advanced batch processing tools such as pig and cascading, then it should be easy to understand Trident, because many of the concepts and ideas are similar. Tident provides joins, aggregations, grouping, functions, and filters capabilities. In addition, Trident provides a number of specialized primitives to cope with stateful incremental processing based on the premise of a database or other storage.


Examples Show

Let's take a look at a trident example. In this case, we have done two things mainly:

    • Reading a statement from a streaming input count the number of each word
    • Provides the ability to query the current total of each word in a given word list


Because this is just an example, we will read the statement as input from an infinite input stream such as the following:

    1. fixedbatchspout spout = new Fixedbatchspout (new fields ("sentence"), 3,
    2. New Values ("The cow jumped over the moon"),
    3. New Values ("The man went to the store and bought some candy"),
    4. New Values ("Four score and seven years ago"),
    5. New Values ("How many apples can eat"),
    6. Spout.setcycle (TRUE);
Copy Code

The spout will loop through the list of those statements into the sentence stream, and the following code will use this stream as input and count the number of each word:


    1. Tridenttopology topology = new Tridenttopology ();
    2. Tridentstate wordcounts =
    3. Topology.newstream ("spout1", spout)
    4. . each (new fields ("sentence"), new Split (), New fields ("word"))
    5. . GroupBy (New fields ("word"))
    6. . Persistentaggregate (New Memorymapstate.factory (), New Count (), New fields ("Count"))
    7. . Parallelismhint (6);
Copy Code

Let's read this piece of code together. We first created a Tridenttopology object. The Tridenttopology class corresponds to the interface to construct all the contents of the Trident calculation process. When we call the Newstream method of the Tridenttopology class, we pass in a spout object, and the spout object reads the data from the outside and outputs it to the current topology, creating a new data stream in the topology. In this example, we use the Fixedbatchspout object defined above. The input data source can also be a queue service such as Kestrel or Kafka. Trident will keep a small amount of state information in zookeeper to track the processing of the data, while the string "SPOUT1" We specify in the code The Znode node that is used to store metadata information in zookeeper is processed by converting input into batches of several tuples when processing the input stream. For example, the input sentence stream may be split into the following batch:<ignore_js_op>  in general, the tuple in these small batch may be in the order of magnitude thousands of or millions of, It all depends on the throughput of your inputs. Trident provides a series of very mature batch processing APIs to handle these small batches. These APIs are very similar to what you see in pig or cascading, you can do group by ' s, joins, aggregations, run functions, execute filters, and so on. Of course, independent processing of each small batch is not a very interesting thing, so Trident provides a lot of functionality to achieve aggregated results between batches and can store the results of these aggregations in memory, Memcached, Cassandra, or some other storage. At the same time, Trident also provides very good features to query real-time status. These real-time states can be Trident updated, and it can also be a standalone state source. Back in our example, spout outputs a data stream with only a single field "sentence". On the next line, topology uses the Split function to split each tuple,split function in the stream and reads the "Sentence" field in the input stream and splits it into several word tuples. Each sentence tuple may be converted to multiple word tuples, such as "the COw jumped over the moon "will be converted into 6" word "tuples. Here is the definition of split:

    1. public class Split extends Basefunction {
    2. public void execute (tridenttuple tuple, tridentcollector collector) {
    3. String sentence = tuple.getstring (0);
    4. For (String word:sentence.split ("")) {
    5. Collector.emit (New Values (word));
    6. }
    7. }
    8. }
Copy Code

As you can see, it's really simple. It simply splits the sentence according to the space and splits each word as a tuple output. The rest of topology calculates the number of words and saves the results to persistent storage. First, Word stream is a group operation based on the Word field, and then each group uses the count aggregator to persist aggregations. Persistentaggregate will help you store or update the results of a state source aggregation into storage. In this example, the number of words is kept in memory, but we can easily save the data to other storage, such as Memcached, Cassandra, etc. If we are going to store the results in memcached, simply replace the persistentaggregate with the following sentence, where "serverlocations" is the list of hosts and port numbers for memcached cluster:

    1. . Persistentaggregate (Memcachedstate.transactional (serverlocations), New Count (), New fields ("Count"))
    2. Memcachedstate.transactional ()
The data stored by the copy code persistentaggregate is the result of all batch aggregations. The cool thing about Trident is that it is completely fault-tolerant, and the owner has the semantics of only one processing. This makes it easy for you to use Trident for real-time data processing. Trident will keep the state in some form, and when an error occurs, it will restore the state as needed. The Persistentaggregate method converts the data flow to a Tridentstate object. In this example, the Tridentstate object represents the number of all words. We will use this Tridentstate object to implement distributed queries during the calculation process. The following section implements a distributed query that has a low-latency number of words. This query takes a list of words separated by a space as input and returns the number of words on that day. These queries are executed in the same way as normal RPC calls, and to be different, they are executed in parallel in the background. Here is an example of executing a query:
    1. Drpcclient client = new Drpcclient ("Drpc.server.location", 3772);
    2. System.out.println (Client.execute ("words", "Cat Dog The Man");
    3. Prints the json-encoded result, e.g.: "[[5078]]"
Copy Code

As you can see, this looks like a normal RPC call except that it's concurrent execution on the storm cluster. The latency for such a simple query is usually around 10 milliseconds. Of course, a more responsible DRPC call can take longer, although the delay is largely dependent on how much resources you allocate to the calculation. The implementation of this distributed query is as follows:

    1. Topology.newdrpcstream ("words")
    2. . each (new fields ("args"), new Split (), New fields ("word"))
    3. . GroupBy (New fields ("word"))
    4. . Statequery (Wordcounts, new fields ("word"), new Mapget (), New fields ("Count"))
    5. . each (New fields ("Count"), new Filternull ())
    6. . Aggregate (New fields ("Count"), new Sum (), New fields ("Sum");
Copy Code

We still use the Tridenttopology object to create the Drpc stream, and we name the function "words". This function name is used as the first parameter when executing a query using the DRPC client. Each DRPC request is treated as a batch with only one tuple. In the process of processing, this request is represented by a single tuple of this input. This tuple contains a field called "Args", which holds the query parameters provided by the client. In this example, this parameter is a list of words separated by a space. First, we use the Splict function to split the entry into separate words. The group by operation is then done for word, and you can then use Statequery to query on the Tridentstate object created in the code above. Statequery accepts a data source (in this case, the number of words that our Topolgoy calculates) and a function for querying as input. In this example, we use the Mapget function to get the number of occurrences of each word. Since DRPC Stream uses the same group way as Tridentstate (group according to the Word field), the query for each word is routed to the Tridentstate Object Management and Update section of the word to execute. Next, we use the Filternull filter to remove the words that have never been seen, and use the sum aggregator to add these count together. Eventually, the Trident will automatically send this result back to the waiting client. Trident is very intelligent in terms of how best to ensure performance of Topogloy. Two very interesting things happen automatically in topology:
    • Operations that read and update state (for example, Persistentaggregate and Statequery) are automatically the form of batch operation state. If there are 20 updates that need to be synchronized to the storage, Trident will automatically summarize the operations together and do one read at a time instead of 20 reads and 20 write operations. So you can easily perform calculations while ensuring very good performance.
    • The Trident aggregator has been optimized very well. Trident does not simply send all the tuples in a group to the same machine for aggregation, but a partial aggregation before it is sent. For example, the count aggregator will first count on each partition and then summarize each shard count together to get the final count. This technique is actually a thought with combiner in MapReduce.
Let's look at another example of Trident.

Reach

The next example is a purely drpc topology. This topology accountant calculates reach for a given URL. So what is reach, where we define reach as how many independent users expose a given URL on Twitter, then we call that number the reach of this URL. To calculate reach, you need to tweet the owner of the URL, and then find all the people who follow them, and follower them, and finally get the number of follower to go back to. If the entire process of calculating reach is placed on one machine, it is too reluctant, as this will require thousands of database calls and the last tuple read. If you use Storm and Trident, you can make these calculation steps concurrent throughout the cluster. This topology will read two state sources. A database that holds the relationship between the URL and the person who tweets the URL. There is also a database that keeps the relationship between man and his follower. Topology is defined as follows:

  1. Tridentstate urltotweeters =
  2. Topology.newstaticstate (Geturltotweetersstate ());
  3. Tridentstate tweeterstofollowers =
  4. Topology.newstaticstate (Gettweetertofollowersstate ());
  5. Topology.newdrpcstream ("Reach")
  6. . Statequery (Urltotweeters, new fields ("args"), new Mapget (), New fields ("tweeters"))
  7. . each (New fields ("Tweeters"), new Expandlist (), New fields ("Tweeter"))
  8. . Shuffle ()
  9. . Statequery (Tweeterstofollowers, new fields ("Tweeter"), new Mapget (), New fields ("followers"))
  10. . Parallelismhint (200)
  11. . each (new fields ("followers"), new Expandlist (), New fields ("Follower"))
  12. . GroupBy (New fields ("Follower"))
  13. . Aggregate (New One (), new fields ("one"))
  14. . Parallelismhint (20)
  15. . Aggregate (New Count (), New fields ("reach");
Copy Code

This topology uses the Newstaticstate method to create a Tridentstate object to represent an external store. With this Tridentstate object, we can make a dynamic query on this topology. As with all state sources, lookups on the database are automatically executed in batches to maximize efficiency. The definition of this topology is very intuitive-just a simple batch processing job. First, query the Urltotweeters database to get a list of people who tweet through this URL. This query returns a list, so we use the Expandlist function to convert each tweeter to a tuple. Next, we'll get the follower for each tweeter. We use shuffle to distribute the tweeter that we want to handle to each worker in the toplology operation to be processed concurrently. Then query the follower database thereby to the follower of each tweeter. You can see that we are allocating a lot of parallelism to this part of topology because this is part of the most resource-intensive calculation in the entire topology. We then group by using the group by operation on the follower and use one aggregator for each group. This aggregator simply outputs a tuple "one" for each group, and then count "a" to the number of different follower. The "one" aggregator is defined as follows:

    1. Public class One implements Combineraggregator<integer> {
    2. Public Integer init (tridenttuple tuple) {
    3. return 1;
    4. }
    5. Public integer Combine (integer val1, integer val2) {
    6. return 1;
    7. }
    8. Public Integer Zero () {
    9. return 1;
    10. }
    11. }
Copy Code

This is a "rollup aggregator" that maximizes performance by performing a local summary of the delivery results to other worker summaries. Sum is also a rollup aggregator, so the final operation with sum as the topology is very efficient. Next, let's take a look at some of Trident's details.

Fields and tuples

The Trident data model is tridenttuple-a list of well-known values. In a topology, a tuple is generated incrementally in a series of processing operations (operation). Operation generally takes a set of bullets as input and outputs a set of function fields. The input field of the operation is often a subset of the input tuple, while the function field is the output of the operation. Take a look at the example below. Suppose you have a stream called "stream," which contains three fields of "X", "Y" and "Z". In order to run a read "Y" as the input filter myfilter, you can write:

    1. Stream.each (New fields ("Y"), New Myfilter ())
Copy Code


Suppose the implementation of Myfilter is this:

    1. public class Myfilter extends Basefilter {
    2. public boolean iskeep (Tridenttuple tuple) {
    3. Return Tuple.getinteger (0) < 10;
    4. }
    5. }
Copy Code

This preserves all tuples that have a "Y" field less than 10. Tridenttuple the input of a myfilter will only have the field "Y". It is important to note that when you select an input field, Trident automatically projects a subset of the tuple, which is very efficient. Let's take a look at how the "functional fields" work. Suppose you have the following function:

    1. public class Addandmultiply extends Basefunction {
    2. public void execute (tridenttuple tuple, tridentcollector collector) {
    3. int i1 = Tuple.getinteger (0);
    4. int i2 = Tuple.getinteger (1);
    5. Collector.emit (New Values (I1 + i2, I1 * i2));
    6. }
    7. }
Copy Code


This function receives two numbers as input and outputs two new values: "Sum" and "product". Suppose you have a stream that contains the "X", "Y", and "Z" three fields. You can use this function in this way:

    1. Stream.each ("x", "Y"), New Addandmultiply (), New fields ("added", "multiplied"));
Copy Code

The function field of the output is added to the input tuple. So this time, each tuple will have 5 fields "X", "Y", "Z", "added", and "multiplied". "added" and "multiplied" correspond to the first and second fields of the addandmultiply output. In addition, we can use the aggregator to replace the input tuple with the output field. If you have a stream that contains fields "Val1" and "Val2", you can do this:

    1. Stream.aggregate (New Fields ("Val2"), New Sum (), New fields ("Sum"))
Copy Code

The output stream will contain only a field called Sum, which is the cumulative sum of "val2". On the stream after the group, the output will be the field of the group and the output of the aggregator. Examples are as follows:

    1. Stream.groupby (New Fields ("Val1"))
    2. . Aggregate (New fields ("Val2"), New Sum (), New fields ("Sum"))
Copy Code

In this example, the output will contain the field "Val1" and "sum".

State

One of the main problems in real-time computing is how to manage the state and make it easy to deal with errors and retries. It is important to eliminate the effects of errors, because when a node dies, or some other problem arises, that batch needs to be re-processed. The question is-how do you do a status update to ensure that each message is processed and processed only once? This is a tricky question that we can explain further in the following example. Assume that you are doing a count aggregation of your stream, and you want to store the count of the runtime into a database. If you just store this count into the database and want to do an update, we have no way of knowing that the same state has been updated before. This update may have been tried before and has been successfully updated in the database, but failed in the subsequent steps. It is also possible that the database was last updated in the process of failure, which you do not know. Trident solves this problem by doing the following two things:
    • Each batch is given a unique ID "transaction ID". If a batch is retried, it will have the same transaction ID as before
    • Status updates are ordered in batch. In other words, the status update for batch 3 must wait until the status update of batch 2 is successful.
With these 2 principles, you can achieve a goal with only one update. You can save the transaction ID and count together in an atomic way into the database. When updating a count, you need to determine the transaction ID of the current batch in the database. If it is the same as the transaction ID that you want to update, skip this update. If it is different, update this count. Of course, you don't need to handle these logic manually in topology. These logic has been encapsulated in the abstract of the stage and is automatically performed. Your stage object does not need to intern the transaction ID tracking operation. If you want to learn more about how to implement a stage and some trade-offs in the process of fault tolerance, you can refer to this article. A stage can use any policy to store the state. It can be stored in an external database, or it can be kept in memory and backed up into HDFs. The stage does not require a permanent hold state. For example, you have a memory version of the stage implementation that saves the last X hours of data and discards the old data. We can take Memcached integration as an example to see the implementation of State.

Execution of Trident topologies

Trident's topology will be compiled into the storm topology as efficiently as possible. Only if you need to repartition the data (such as groupby or shuffle) will send the tuple through the network, if you have a trident as follows:<ignore_js_op> It will be compiled into the following storm topology:<ignore_js_op>

3.png (67.33 KB, download number: 0)

Download attachments to albums

2014-9-18 04:02 Upload

Conclusion

Trident makes real-time computing more elegant. You've seen how to use the Trident API to perform high-throughput streaming calculations, state maintenance, low latency queries, and more. Trident allows you to perform real-time calculations in a more natural way while maximizing performance.

Translation [Trident] Storm Trident Detailed Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.