Some of the common topology patterns of storm

Source: Internet
Author: User
Tags ack documentation emit http request
Original address: http://storm.apache.org/documentation/Common-patterns.htmlThis document has Xu Mingming translations: http://xumingming.sinaapp.com/189/twitter-storm-storm%E7%9A%84%E4%B8%80%E4%BA%9B%E5%B8%B8%E8%A7%81% e6%a8%a1%e5%bc%8f/
But the latest documentation is slightly updated, so here's the re-collation: This article lists some of the common patterns of storm topology:Streaming aggregation batch Basicbolt in-memory cache + fields grouping's combined streaming top n calculation uses Timecachemap to efficiently save the cache of recently updated data Distributed Rpc:coordinatedbolt and Keyedfairbolt

Aggregation (Joins)

Stream aggregation aggregates two or more streams of data based on a number of common fields. However, an ordinary database aggregation input is limited, the semantics is also very clear, but the flow aggregation input is infinite, the semantics is not clear.

Each application is aggregated in different ways. Some applications always aggregate a tuple of two streams, and some applications only want to aggregate based on specific fields, while other applications may have different aggregation logic. Among all aggregation types, there is a common pattern of dividing multiple input streams in the same way (partitioning). Field grouping can be used on some fields in storm, which makes it easy to aggregate multiple input streams into joiner bolts, for example:

Builder.setbolt ("Join", New Myjoiner (), parallelism)
  . fieldsgrouping ("1", New Fields ("Joinfield1", "Joinfield2") )
  . Fieldsgrouping ("2", New Fields ("Joinfield1", "Joinfield2"))
  . Fieldsgrouping ("3", New Fields ("Joinfield1" , "Joinfield2"));

Of course, the "Same" field in different data streams can have names that are not the same. batch processing (batching)

Sometimes, for performance or some other reason, you want to batch a group of tuples instead of dealing with them individually. For example, you might want to batch update a database or do a stream aggregation (aggregation) in some way.

If you want to reliably process data, the correct way is to save references to these tuple objects until the bolt batch is complete. Once the batch is complete, the tuple is then ACK-operated.

If the bolt emits (emit) tuple, then you might want to use multi-anchoring to ensure reliability. This requires concrete analysis of specific circumstances. Refer to guaranteeing message processing to learn more details on how to work reliably. Basicbolt

Many bolts follow a simple pattern:

Reads an input tuple based on this input tuple, emits 0 or more tuples at the end of the Execute method, immediately ack the input tuple follows this pattern of bolts that are functions (function) and filters (filter). Storm has encapsulated a single interface for this pattern: Ibasicbolt. Read guaranteeing message processing for more information.

in-memory cache + Fields grouping combination (In-memory caching + Fields grouping combo)

It is common to keep some caches in storm bolts. Caching becomes particularly useful when you use the fields grouping to merge (combine). For example, suppose you have a bolt that converts a short link to a long link (such as bit.ly,t.co). You can use an LRU cache to maintain short links to long-link mappings to improve performance and prevent the same HTTP request from being too frequent. Assuming that the "URLs" component launches a short link, the "Expand" component converts the short link to a long link and maintains a cache internally. Look at the differences between the following two code:

Builder.setbolt ("Expand", New Expandurl (), parallelism)
  . shufflegrouping (1);
Builder.setbolt ("Expand", New Expandurl (), parallelism)
  . fieldsgrouping ("URLs", New Fields ("URLs"));

The 2nd way to use caching is much more efficient than the 1th because the same URL is always sent to the same task. This avoids the existence of a cache in more than one task, and also improves the cache hit rate.

flow Top n calculation (streaming top N)

Storm has a common pattern called continuous computing, and the flow top n calculation in storm is in this mode. If you have a bolt that emits ["value", "Count"] this form of tuple, and has a bolt to emit a tuple of top n according to the count. The simplest way is to have a bolt that makes a global grouping on the data stream and maintains a list of top N.

This approach is obviously not scalable for streams with large data volumes, because all stream data is sent to the same task. A better approach is to calculate the top N of a portion of the stream in parallel on multiple machines, and there is a bolt to merge the top N intermediate calculations for each part, and finally the final top N (Mr Thought), which looks like this:

Builder.setbolt ("Rank", new rankobjects (), parallellism)
  . fieldsgrouping ("Objects", new fields ("value"));
Builder.setbolt ("merge", New Mergeobjects ())
  . Globalgrouping ("rank");

This mode works because the first bolt did the fields grouping make this parallel algorithm semantically correct.

You can find a sample program in the Storm-starter project here. use Timecachemap to efficiently save the cache of recently updated data (Timecachemap for efficiently keeping a cache of things that has been recently updated)

You sometimes want to cache recently active objects in memory and want to automatically expire objects that have not been active for some time. Timecachemap is an efficient data structure that is suitable for such requirements. and provides hooks, so you can add callback functions that are automatically called when an object expires. (For Timecachemap Why efficient, you can look at this analysis article) Distributed Rpc:coordinatedbolt and Keyedfairbolt (Coordinatedbolt and Keyedfairbolt for Distributed RPC)

When building distributed RPC applications over storm, 2 common patterns are typically used. They are encapsulated in Coordinatedbolt and Keyedfairbolt, and are part of the standard library in the storm's code base.

Coordinatedbolt wraps your bolt, which contains your logic, and determines when your bolt receives all requests that correspond to a particular request. Used primarily in direct stream.

Keyedfairbolt is also used to wrap your bolts, which contain your logic, and ensure that your topology handles multiple DRPC calls at the same time, rather than serial processing.

For more distributed RPC, refer to distributed RPC.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.