Storm grouping mechanism detailed (two references included)

Source: Internet
Author: User
Tags emit mixed shuffle
Reference 1:

Shufflegrouping

Defines a stream grouping as a mix. This mixed grouping means that input from spout will be mixed or distributed randomly to the tasks in this bolt. Shuffle grouping is more uniform on the tuple allocations for each task.

Fieldsgrouping

This grouping mechanism guarantees that a tuple of the same field value will go to the same task, which is critical for wordcount, and if the same word does not go to the same task, the number of words counted is wrong.

All grouping

Broadcast sent, for each tuple will be copied to each bolt processing.

Global Grouping

All the tuples in the stream are sent to the same bolt task processing, and all the tuple will be sent to the bolt task processing with the minimum task_id.

None Grouping

This approach is not concerned with the parallel processing of load balancing policies, which is now equivalent to shuffle grouping, and Storm will arrange the bolt task and his upstream data-providing task under the same thread.

Direct Grouping

A tuple's launch unit directly determines that a tuple will be emitted to that Bolt, which in general is determined by the bolt that receives the tuple to receive a tuple of which bolt is emitted. This is a very special grouping method, which means that the sender of the message specifies which task of the message receiver handles the message. Only message flows that are declared as direct stream can declare this grouping method. And this message tuple must use the Emitdirect method to launch. The message processor can get the taskid of the message that handles it by Topologycontext (the Outputcollector.emit method also returns TaskID) fieldsgrouping

The above information I excerpt from: http://xumingming.sinaapp.com/127/twitter-storm%E5%A6%82%E4%BD%95%E4%BF%9D%E8%AF%81%E6%B6%88%E6%81% af%e4%b8%8d%e4%b8%a2%e5%a4%b1/

If you know storm, I think you can understand most of the grouping. Here's grouping strategy I would like to highlight the fieldsgrouping, but also the most difficult to understand.

Fieldsgrouping are grouped by the values of field fields in the data. Here is my test code:

Topologybuilder builder = new Topologybuilder ();
Builder.setspout ("Words", new Testwordspout (), 2); 
Builder.setbolt ("exclaim2", New Defaultstringbolt (), 5)
        . Fieldsgrouping ("Words", new fields ("word"));

The example of the test spout is the storm's own example, the blot code is as follows:

public void execute (tuple tuple) {
    Log.info ("Rev A message:" + tuple.getstring (0));
    Collector.emit (tuple, New Values (tuple.getstring (0) + "!!!"));
    Collector.ack (tuple);
}
public void Declareoutputfields (Outputfieldsdeclarer declarer) {
    Declarer.declare (new fields ("word"));
}

Storm's own example spout can randomly return new string[] {"Nathan", "Mike", "Jackson", "Golda", "Bertels"}, and a few strings in the list. This is also a good example of testing fieldgroup.

As I first did before storm began to understand that since it is grouped by field, then all the same field is worth the data to reach a blot. I have tested many times and the results are not so, and a blot will receive several different values. I did not delve into what is so special about Storm's grouping that it has been stagnant for a long time.

Storm can ensure that all data of the same field value arrives at the same blot, but does not guarantee that a blot will handle only one domain.

That is, all values are Nathan can reach a blot, but the value of reaching the same blot may have multiple, such as "Nathan", "Mike" Data arrived.

To understand this, fieldsgrouping even understand.

Here is the test log:

9144 [thread-35-exclaim2] INFO Cn.pointways.dstorm.bolt.defaultstringbolt-rev a message:bertels 9234 [Thread-35-excla IM2] Info Cn.pointways.dstorm.bolt.defaultstringbolt-rev a message:mike 9245 [thread-33-exclaim2] Info CN.POINTWAYS.D  Storm.bolt.defaultstringbolt-rev a Message:nathan 9335 [thread-26-exclaim2] INFO  Cn.pointways.dstorm.bolt.defaultstringbolt-rev a Message:golda 9346 [thread-26-exclaim2] INFO  Cn.pointways.dstorm.bolt.defaultstringbolt-rev a Message:golda 9437 [thread-35-exclaim2] INFO  Cn.pointways.dstorm.bolt.defaultstringbolt-rev a Message:jackson 9447 [thread-35-exclaim2] INFO  Cn.pointways.dstorm.bolt.defaultstringbolt-rev a message:mike 9537 [thread-26-exclaim2] INFO  Cn.pointways.dstorm.bolt.defaultstringbolt-rev a Message:golda 9548 [thread-35-exclaim2] INFO  Cn.pointways.dstorm.bolt.defaultstringbolt-rev a Message:jackson 9639 [thread-33-exclaim2] INFO Cn.pointways.dstorm.bolt.defaultstringbolt-rev a Message:nathan 9649 [Thread-35-EXCLAIM2] Info Cn.pointways.dstorm.bolt.defaultstringbolt-rev a message:jackson 9740 [Thread-33-exclaim2] Info CN.P  Ointways.dstorm.bolt.defaultstringbolt-rev a Message:nathan 9749 [thread-35-exclaim2] INFO  Cn.pointways.dstorm.bolt.defaultstringbolt-rev a Message:jackson 9841 [thread-35-exclaim2] INFO  Cn.pointways.dstorm.bolt.defaultstringbolt-rev a message:bertels 9850 [thread-26-exclaim2] INFO Cn.pointways.dstorm.bolt.defaultstringbolt-rev a Message:golda

As can be seen from the above log, Golda This value of data, indeed merged into a blot processing. Thread Number: THREAD-26-EXCLAIM2. The other values are also the same values that are processed within a thread.


Reference 2:

When I recently studied Storm's stream grouping, I didn't understand the field grouping and shuffle grouping very well. To see Wordcounttopology also don't understand, later brain hole open, add a line of code to run again, a thorough epiphany. I can only say that the basic concept of storm is still not thoroughly understand AH. (Wordcounttopology This example, please refer to storm-starter yourself) [Java] View plain copy PublicvoidExecute (tuple tuple, basicoutputcollector collector) {String word = tuple.getstring (0); The purpose of adding this line of code is to see if Word with equal value is executed by the same instance, and the real-time proof is really so System.out.println ( This+ "= = =" + word); Integer count = Counts.get (word);if(Count = =NULL) Count = 0;       count++;       Counts.put (Word, count); Collector.emit (NewValues (Word, count)); }

After repeated testing, here are some of my personal summary, if there are missing or wrong I will promptly correct.

In the official document there is this sentence: "If the stream is grouped by the" User-id "field, tuples with the same" User-id "would always go to the same TA Sk

A task is an instance of a processing logic, so fields can be based on the ID of a tuple stream, which is defined as the following xxx [Java] view plain copy public void   Declareoutputfields (Outputfieldsdeclarer declarer) {declarer.declare (new fields ("xxx")); }

The specific content that XXX represents is handled by a task and the same content as the same xxx, the task instance that handles this content is the same.

"The concept of field within the Strom"


For example:

Bolt emit three streams for the first time, that is, XXX has luonq Pangyang QINNL three values, assuming that three task instances are created to process: [plain] View plain copy luonq-Instance1 Pangyang- > Instance2 QINNL-instance3


Then the second time emit four streams, that is, XXX has luonq qinnanluo py Pangyang Four values, assuming or is handled by just three task instances: [Plain] View plain copy luonq, Instance1 Qin Nanluo, Instance2 py, Instance3 Pangyang, Instance2


Then the third time emit two streams, that is, XXX has a py QINNL two values, assuming that it is still handled by the three task instances just now:
[Plain] View plain copy py, Instance3 QINNL, Instance3


Finally, let's look at what the three task instances have handled, and how many times they have processed them:

Instance1:luonq (processed 2 times)
Instance2:pangyang (processing 2 times) Qinnanluo (1 times)
INSTANCE3:QINNL (processing 2 times) py (Processing 2 times)

Conclusion:
1. The first time the value emitted by a emit is processed by which task instance is random, and then this value reappears, it is fixed by the task instance that originally handled him again, until topology ends

2. A task instance can handle values emitted by multiple emit

3. The difference between shuffle and grouping is that when the emit emits the same value, the task to handle it is random.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.