Some of the common topology patterns of storm

Last Update:2018-07-25 Source: Internet

Author: User

Tags ack documentation emit http request

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original address: http://storm.apache.org/documentation/Common-patterns.htmlThis document has Xu Mingming translations: http://xumingming.sinaapp.com/189/twitter-storm-storm%E7%9A%84%E4%B8%80%E4%BA%9B%E5%B8%B8%E8%A7%81% e6%a8%a1%e5%bc%8f/
But the latest documentation is slightly updated, so here's the re-collation: This article lists some of the common patterns of storm topology:Streaming aggregation batch Basicbolt in-memory cache + fields grouping's combined streaming top n calculation uses Timecachemap to efficiently save the cache of recently updated data Distributed Rpc:coordinatedbolt and Keyedfairbolt

Aggregation (Joins)

Stream aggregation aggregates two or more streams of data based on a number of common fields. However, an ordinary database aggregation input is limited, the semantics is also very clear, but the flow aggregation input is infinite, the semantics is not clear.

Each application is aggregated in different ways. Some applications always aggregate a tuple of two streams, and some applications only want to aggregate based on specific fields, while other applications may have different aggregation logic. Among all aggregation types, there is a common pattern of dividing multiple input streams in the same way (partitioning). Field grouping can be used on some fields in storm, which makes it easy to aggregate multiple input streams into joiner bolts, for example:

Builder.setbolt ("Join", New Myjoiner (), parallelism)
  . fieldsgrouping ("1", New Fields ("Joinfield1", "Joinfield2") )
  . Fieldsgrouping ("2", New Fields ("Joinfield1", "Joinfield2"))
  . Fieldsgrouping ("3", New Fields ("Joinfield1" , "Joinfield2"));

Of course, the "Same" field in different data streams can have names that are not the same. batch processing (batching)

Sometimes, for performance or some other reason, you want to batch a group of tuples instead of dealing with them individually. For example, you might want to batch update a database or do a stream aggregation (aggregation) in some way.

If you want to reliably process data, the correct way is to save references to these tuple objects until the bolt batch is complete. Once the batch is complete, the tuple is then ACK-operated.

If the bolt emits (emit) tuple, then you might want to use multi-anchoring to ensure reliability. This requires concrete analysis of specific circumstances. Refer to guaranteeing message processing to learn more details on how to work reliably. Basicbolt

Many bolts follow a simple pattern:

Reads an input tuple based on this input tuple, emits 0 or more tuples at the end of the Execute method, immediately ack the input tuple follows this pattern of bolts that are functions (function) and filters (filter). Storm has encapsulated a single interface for this pattern: Ibasicbolt. Read guaranteeing message processing for more information.

in-memory cache + Fields grouping combination (In-memory caching + Fields grouping combo)

It is common to keep some caches in storm bolts. Caching becomes particularly useful when you use the fields grouping to merge (combine). For example, suppose you have a bolt that converts a short link to a long link (such as bit.ly,t.co). You can use an LRU cache to maintain short links to long-link mappings to improve performance and prevent the same HTTP request from being too frequent. Assuming that the "URLs" component launches a short link, the "Expand" component converts the short link to a long link and maintains a cache internally. Look at the differences between the following two code:

Builder.setbolt ("Expand", New Expandurl (), parallelism)
  . shufflegrouping (1);

Builder.setbolt ("Expand", New Expandurl (), parallelism)
  . fieldsgrouping ("URLs", New Fields ("URLs"));

The 2nd way to use caching is much more efficient than the 1th because the same URL is always sent to the same task. This avoids the existence of a cache in more than one task, and also improves the cache hit rate.

flow Top n calculation (streaming top N)

Storm has a common pattern called continuous computing, and the flow top n calculation in storm is in this mode. If you have a bolt that emits ["value", "Count"] this form of tuple, and has a bolt to emit a tuple of top n according to the count. The simplest way is to have a bolt that makes a global grouping on the data stream and maintains a list of top N.

This approach is obviously not scalable for streams with large data volumes, because all stream data is sent to the same task. A better approach is to calculate the top N of a portion of the stream in parallel on multiple machines, and there is a bolt to merge the top N intermediate calculations for each part, and finally the final top N (Mr Thought), which looks like this:

Builder.setbolt ("Rank", new rankobjects (), parallellism)
  . fieldsgrouping ("Objects", new fields ("value"));
Builder.setbolt ("merge", New Mergeobjects ())
  . Globalgrouping ("rank");

This mode works because the first bolt did the fields grouping make this parallel algorithm semantically correct.

You can find a sample program in the Storm-starter project here. use Timecachemap to efficiently save the cache of recently updated data (Timecachemap for efficiently keeping a cache of things that has been recently updated)

You sometimes want to cache recently active objects in memory and want to automatically expire objects that have not been active for some time. Timecachemap is an efficient data structure that is suitable for such requirements. and provides hooks, so you can add callback functions that are automatically called when an object expires. (For Timecachemap Why efficient, you can look at this analysis article) Distributed Rpc:coordinatedbolt and Keyedfairbolt (Coordinatedbolt and Keyedfairbolt for Distributed RPC)

When building distributed RPC applications over storm, 2 common patterns are typically used. They are encapsulated in Coordinatedbolt and Keyedfairbolt, and are part of the standard library in the storm's code base.

Coordinatedbolt wraps your bolt, which contains your logic, and determines when your bolt receives all requests that correspond to a particular request. Used primarily in direct stream.

Keyedfairbolt is also used to wrap your bolts, which contain your logic, and ensure that your topology handles multiple DRPC calls at the same time, rather than serial processing.

For more distributed RPC, refer to distributed RPC.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Some of the common topology patterns of storm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Some of the common topology patterns of storm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support