Storm from getting started to mastering the third section storm messaging mechanism

Source: Internet
Author: User
Tags ack emit hash unique id
Main contentStorm has an important messaging mechanism---ensure that every message sent by spout is handled in full, and this section explains how storm guarantees message integrity and reliability.
Topologybuilder builder = new Topologybuilder ();
Builder.setspout ("Sentences", New Kestrelspout ("kestrel.backtype.com",
                                               22133,
                                               "Sentence_queue",
                                               New Stringscheme ()));
Builder.setbolt ("Split", New Splitsentence (), ten)
        . shufflegrouping ("sentences");
Builder.setbolt ("Count", New WordCount (), +)
        . Fieldsgrouping ("Split", New Fields ("word"));


The above topology reads the sentences from the Kestrel queue, divides them into phrases, and then sends each word by the number of times each word is counted before the phrase is divided. A tuple leaving spout may trigger the creation of many tuples based on it: each word in the sentence corresponds to a tuple, and the number of each word corresponds to a tuple. The tree structure of the message is as follows:

When the tuple tree is exhausted and every message in the tree has been processed, Storm considers the tuple to be detached from a tuple that is fully processed. A tuple is considered to fail when its tree's message fails to be fully processed within the specified timeout. You can have the Config.topology_message_timeout_secs configuration configure this timeout on a specific topology topology, with a default value of 30 seconds.

<span style= "Font-family:comic Sans ms;font-size:14px;" >public interface ispout extendsserializable {
         voidopen (Map conf, Topologycontext context,spoutoutputcollector Collector);
         voidclose ();
         voidnexttuple ();
         Voidack (object msgId);
         Voidfail (object msgId);
} </span>
What happens when a message is fully processed or failedTo understand this problem, let's take a look at the life cycle of a tuple from an exit. As a reference, here is the interface implemented by spouts (For more information, see Javadoc):
<span style= "Font-family:comic Sans ms;font-size:14px;" >public interface Ispout extends Serializable {
    void open (Map conf, topologycontext context, Spoutoutputcollector Collector);
    void Close ();
    void Nexttuple ();
    void Ack (Object msgId);
    void Fail (Object msgId);
} </span>

First, Storm uses the spout Nexttuple method to request a tuple from spout. In the open method, spout uses this method to provide spoutoutputcollector to emit a tuple into the output streams. When a tuple is launched, spout provides a "message ID" that is used to distinguish between different tuple types later. For example, Kestrelspout reads a message from the Kestrel queue and then, at launch, Kestrel the message with the ID as a "message ID." Launch a message to Spoutoutputcollector as follows:

<span style= "Font-family:comic Sans ms;font-size:14px;" >_collector.emit (NewValues ("Field1", "Field2", 3), msgId);</span>

The tuple is then sent to bolts, and storm tracks the message tree that has been created. If Storm detects that a tuple has been "fully processed", Storm will raise the Ack method on the spout that the tuple is fired, and the parameter msgid is the "message ID" provided by this spout to storm. Similarly, if the tuple expires, storm calls spout to raise the Fail method. Note that a tuple can only be acked or failed on the spout task to which it was created. Therefore, even if a spout is performing many tasks on the cluster, a tuple can only be acked or failed by the task that created it, while the other task is not.

Use Kestrelspout again as an example to see how spout guarantees message processing. When Kestrlespout takes a message from the Kestrel queue, it opens the message. This means that the message is not actually taken out of the queue, but is waiting, and it needs to be confirmed that the message has been fully processed. When you are in a pending (waiting) state, the message is not sent to other consumers in the queue. In addition, if the client disconnects, all pending messages for that client are put back into the queue. When the message is opened, Kestrel provides the client with the data for the message and the unique ID of the message. When a tuple is sent to Spoutoutputcollector, Kestrelspout uses that exact ID as the "message ID" of the tuple. Later, when an ACK or fail is raised in the Kestrelspout, Kestrelspout sends an ACK or fail message to kestrel, and the message ID takes the message out of the queue or restarts it. message Reliability to ensure the reliability of the message, there are two conditions to meet.
notify Storm When a tuple is created the operation of adding a child node to the Storm message tree (tuple tree) is called anchoring (anchoring). Storm will anchor behind the scenes when the application sends a new tuple. Or the previous example of streaming a word count, see the following code snippet:

public class Splitsentence extends Baserichbolt {
    outputcollector _collector;
    public void Prepare (MAP conf, topologycontext context, Outputcollector collector) {
        _collector = collector;
    }
    
    public void execute (tuple tuple) {
        String sentence = tuple.getstring (0);
        For (String word:sentence.split ("")) {
            _collector.emit (tuple, new Values (word));
        }   
        _collector.ack (tuple);
    }   
    
    public void Declareoutputfields (Outputfieldsdeclarer declarer) {
        Declarer.declare (new fields ("word"));
    }   
}
Each word tuple is anchored by putting the input tuple as the first parameter in the emit function. By anchoring, Storm is able to get an association between tuples (the input tuple triggers a new tuple) and then constructs the entire message tree triggered by the spout tuple. So when downstream processing fails, it is possible to notify spout that the current message root node's spout tuple processing failed, allowing spout to be re-processed. Conversely, if the input tuple is not specified at the time of emit, it is called not anchored:
_collector.emit (New Values (word));
Firing a word tuple like above causes the tuple to not be anchored (unanchored) so that storm cannot get the tuple's message tree, and then cannot trace whether the message tree is fully processed. This downstream processing fails and cannot be notified to the upstream spout task. Different applications have different fault-tolerant approaches, and sometimes it is necessary to do so without anchoring the scene.
A tuple of an output can be anchored to multiple input tuples, called multiple anchors (multi-anchoring). This is useful when you are merging or aggregating a stream. A multi-anchored tuple processing failure causes the corresponding plurality of input tuples to be re-processed on the spout. Multiple anchors are accomplished by specifying a list of multiple input tuples instead of a single tuple. For example:
List<tuple> anchors = new arraylist<tuple> ();
Anchors.add (tuple1);  
Anchors.add (tuple2);
_collector.emit (Anchors, new Values (word));
Multiple anchors will add this new output tuple to multiple message trees. Note that multiple anchors may break the message's tree structure into a directed acyclic graph (DAG), and storm implementations support both tree and directed acyclic graphs (dags). In this article, the message tree mentioned is equivalent to a non-circular graph. The relationship between messages is an example of a forward-free graph, as shown in the following illustration:

The spout tuple a triggers the B and C two tuples, and the two tuples act as inputs, triggering the D-tuple after the joint action.
notify Storm when the tuple has finished processingThe purpose of anchoring is to specify the structure of the tuple tree-the next step is to notify Storm when a tuple in the tuple tree has finished processing. The notification is done through the ACK and fail functions in the Outputcollector. For example, the above-stream calculation of the number of words in the example of split bolt implementation of the Splitsentence class, you can see the sentence is cut into words, when all the word tuple is fired, will confirm (ACK) input tuple processing is completed.
The root tuple processing for the current message tree has failed, and you can use Outputcollector's fail function to immediately notify Storm. For example, an application might catch an exception to the database client and notify the storm input tuple to fail. By indicating that the storm tuple processing fails, the spout tuple does not have to wait for the timeout to be processed more quickly.
Storm needs to consume memory to keep track of each tuple, so each tuple that is processed must be acknowledged. Because if you do not confirm each tuple, the task will eventually consume the available memory.
bolts that do aggregation or merge operations may delay confirming a tuple until a result is calculated based on a heap of tuples. bolts that aggregate or merge operations often also have multiple anchors for their output tuples.
Storm is an effective way to achieve reliability Acker TasksA storm topology has a special set of "Acker" tasks that are responsible for tracking the processing state of messages triggered by each spout tuple. When a "Acker" sees a message in a spout that is generated by a tuple is fully processed, the spout task that created the spout tuple is notified, and the tuple is successfully processed. You can set the number of ACKER task executor in a topology by using the topology configuration item config.topology_acker_executors. The storm default Topology_acker_executors and the number of workers configured in the topology are the same (for an introduction to executor and worker, see Understanding Storm Concurrency)--for topologies that need to handle a large number of messages, You need to increase the number of Acker executor.
the life cycle of tuplesThe best way to understand how storm's reliability is implemented is to look at the tuple's life cycle and its tuple-formed, forward-free graphs. When a tuple is created in the spout or bolt of a topology, it is given a random 64-bit identifier (message ID). The Acker task uses these IDs to track the processing state of the forward-loop graphs produced by each spout tuple. When a new tuple is generated in the bolt, the message-id of all spout tuples is copied from one or more of the input tuples that are anchored, so each tuple carries its own message-id of the root node of the tuple tree spout the tuple. When confirming that a tuple has been processed successfully, Storm sends a specific message to the corresponding Acker task-informing Acker that a message in the message tree currently generated by this spout tuple has been processed, and that this particular message generates a new message in the message tree ( The input that the new message is anchored to is this particular message).
For example, assuming that the "D" tuple and the "E" tuple are based on the "C" tuple, the following illustration depicts the change of the tuple tree after the successful processing of the "C" tuple. The tuple representation represented by the dashed box in the figure has been deleted on the message tree:


When "C" is removed from the tree, "D" and "E" are added to the tree at the same moment. A tree can never end prematurely.

As already mentioned above, in a topology, there can be any number of Acker tasks. This leads to the following two issues:
1. When a tuple in a topology is confirmed to be processed, or a new tuple is generated, Storm should notify which Acker task.
2, notify the Acker task, Acker task how to notify the corresponding spout task.


Storm uses the spout tuple Message-id hash modulo method carried in tuples to map a tuple to a Acker task (so all messages in the same message tree map to the same Acker task). Because each tuple carries the identity of the root node spout tuple (possibly multiple) in its own tuple tree, Storm can decide which Acker task to notify.
When a spout task outputs a new tuple, it simply sends a message to the corresponding Acker (spout tuple message-id hash modulo) to tell Spout's task ID, To inform Acker that the current spout task is responsible for this message. When Acker sees a message tree being completely processed, it can determine the task ID that generated the spout tuple based on the spout tuple that is carried in the processed tuple, and then notifies the spout task message tree processing to complete (calling the ACK function of the spout task).
Implementation DetailsFor a huge tuple tree with tens of thousands of nodes (or more), tracking all of the tuple trees depletes the memory used by Acker. The Acker task does not display (records the full tree structure) tracking the tuple tree, instead it uses a policy that consumes only a fixed size space (approximately 20 bytes) per spout tuple. This tracking algorithm is the key to Storm's work and one of the major breakthroughs.
A Acker task stores a mapping relationship from a spout tuple message-id to a pair of values Spout-message-id--><spout-task-id, ack-val>. The first value is the task ID of the spout tuple that was created to notify this spout task when subsequent processing is complete. The second value is a 64-bit value called "Ack Val". It is simply a value that message-id all created or confirmed tuples in the message tree. Each message is created and confirmed to be processed in a different or "Ack Val", a xor a = 0, so when an "Ack Val" becomes 0, the entire tuple tree is completely processed. The ACK Val value represents the processing state of messages in the entire tuple tree, whether it is a large or small tuple tree. Since tuple Message-id is a random 64-bit integer, it is particularly unlikely that different tuples in the same tuple tree Message-id crash, so the probability of an "ACK Val" unexpectedly becomes 0 is very small. If this happens, and the tuple fails, it simply causes the tuple's data to be lost.
The idea of using an XOR operation to track the status of a message tree is very good. Because the number of messages can be tens of thousands of lines, each tracking individually (what the reader can think about) is very inefficient and cannot be scaled horizontally. And in a different way, it doesn't depend on the order in which the Acker received the message.
Figuring out the reliability of the algorithm, let's look at all the failed scenarios under how storm avoids data loss:
Bolt task hangs: causes a tuple to not be confirmed, in this scenario, the root node in the message tree where the tuple is located spout The tuple will time out and be re-processed
Acker task hangs: In this scenario, all the spout tuples that are being traced when the Acker hangs are timed out and re-processed
Spout task hangs: In this scenario, you need to apply your own checkpoint mechanism to record the progress of the current spout successful processing, and when the spout task hangs and restarts, continue processing from the current checkpoint, so that the failed tuples can be re-processed.
Adjust reliability
The Acker task is lightweight, so you don't need too many Acker tasks in a topology. The performance of the Acker task can be observed through the storm UI (the component with the ID "__acker"). If the throughput does not look normal, you will need to add more acker tasks.
Eliminate ReliabilityIf reliability doesn't matter--for example, if you don't care about message loss in a tuple failure scenario--you can improve performance by not tracking the tuple's process. Not tracking a tuple tree will halve the number of messages that are delivered, because each tuple in the tuple tree normally has a confirmation message. In addition, this reduces the amount of memory required for each tuple (referred to as the spout Message-id stored per tuple), reducing bandwidth usage.
There are three ways to get rid of reliability:Set Config.topology_ackers to 0. In this case, Storm will call Spout's ACK function immediately after spout spits out a tuple. This tuple tree is not tracked.
This tuple's tracking mechanism is closed by ignoring the message Message-id parameter when generating a new tuple to call the emit function.
If you are not concerned about a particular type of tuple processing failure, you can not use anchoring when calling emit. Because they are not anchored to a spout tuple, they do not cause the spout tuple processing to fail when they are not successfully processed.










Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.