Storm Document (7)----Basic concepts

Last Update:2014-11-27 Source: Internet

Author: User

Tags ack

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint Please specify source: http://blog.csdn.net/beitiandijun/article/details/41546195

Source Address: http://storm.apache.org/documentation/Concepts.html

This article describes the main concepts of storm and gives you a link to see more information about it. The concepts discussed in this article are as follows:

1, topologies

2, Streams

3, spouts

4, bolts

5. Stream Grouping

6, Reliability

7. Tasks

8, Workers

Topologies

The logic of real-time application will be packaged into Storm's topology. Storm topology is similar to a mapreduce job. The key difference is that the MapReduce job will end, but topology will run forever (unless you kill it). Topology is a atlas of spouts and bolts, where spouts and bolts are linked by stream grouping. These concepts are described below.

Information:

L Topologybuilder: Use this class in Java to construct topologies

L Run topologies on the production cluster.

L Local mode: How to develop test topologies in local mode.

Streams

Stream is the core abstraction of storm. Stream is a borderless sequence of tuples, in which a tuple can be produced and processed in parallel in a distributed manner. The streams definition pattern is: name some fields in stream tuples. By default, tuples can contain integers, longs, shorts, bytes, doubles, floats, booleans, and byte arrays. Of course, the customer's own sequence can also be defined so that the customer's own defined new type can be used in tuples.

Each stream is given an ID at the time of declaration. Because Single-stream spouts and bolts are more commonly used, Outputfieldsdeclarer has a more convenient way to declare a single stream without specifying an ID. In this case, the stream ID is the default.

Information:

L Tuple:streams is made up of tuples.

L Outputfieldsdeclarer: The mechanism used to declare flows and their

L Serialization: Information about the tuples of the storm dynamic type and the information that declares the customer sequence

L Iserialization: The customer's sequence must implement this interface

L CONFIG. Topology_serializations: Customers can use this configuration option to register their own sequences

Spouts

Spout is the source of topology data streams. In general, spouts reads tuples from an external data source and then emits them to topology (for example, Kestrel Queue or TWITTERAPI). The spouts can be reliable or unreliable. A reliable spouts can be re-processed when the storm handles a tuple failure, and an unreliable spout will not care about it once the tuples is launched.

Spouts can emit more than one data stream. To emit multiple streams, you need to declare multiple streams using the Declarestream method of the Outputfieldsdeclarer class, and then you need to specify the data stream to be emitted using the emit method of the Spoutoutputcollector class.

spouts The Main method is nexttuple. Nexttuple either launches a new tuple to the topology, or simply returns when there is no new tuples to send. It is necessary that nexttuple not block any spout implementations, because Storm uses the same thread to invoke the spout method. (Once a spout method is blocked by nexttuple, that thread is blocked, and storm cannot use that thread to invoke other spout methods).

Spouts other main methods are ACK and fail. They are called when the storm detects whether a tuple that is being emitted is sent intact. ACK and fail can only be called by a reliable spouts. You can view Javadoc for more information.

Information:

L Irichspout: This is the interface that spouts must implement

L Guarantee Message processing mechanism

Bolts

All data processing in the topologies is done in bolts. Bolts can handle data from a variety of sources: filtered data, data returned by functions, aggregated data, merged data, data generated by interaction with the database, and so on.

Bolts can do simple data stream transformations. Complex data flow transformations typically require multiple steps and multiple bolts. For example, the flow of tweets data into a hotspot image takes at least two steps: a bolts to make the total number of times each picture is tweets repeated, and one or more bolts to output the most hot spots (for this specified stream conversion, if in a more scalable way, You can use three botls instead of two).

Bolts can emit more than one stream. To emit multiple streams, you need to declare multiple streams using the Declarestream method of the Outputfieldsdeclarer class, and then you need to specify the stream to be emitted using the emit method of the Outputcollector class.

When declaring bolts input streams, it is often necessary to subscribe to some particular stream using another component. If you want to subscribe to all streams using this component, you need to subscribe to each stream separately. Inputdeclarer has a syntactic sugar structure that can be used to subscribe to streams declared with the default stream ID. That is: declarer.shufflegrouping ("1") subscribes to the default STREAM on component "1", which is the same as using declarer.shufflegrouping ("1", default_stream_id).

The main method of bolts is the Execute method, which is used to enter a new tuple. Bolts launches a new tuples using the Outputcollector object. The Ack method of the Outputcollector class needs to be called to notify Storm when each tuple of bolts processing is full input, and it can finally confirm that the original spouttuples is safely reaching the bolt. For the general case of input tuple processing, Storm provides the Ibasicbolt interface, which can be automatically acking.

It is recommended that the newly loaded thread in the bolts be processed asynchronously. Outputcollector is multithreaded safe and can be called at any time.

Information:

L Irichbolt: This is the general interface of bolts

L Ibasicbolt: This is a convenient interface for defining bolts for filtering and simple function calculations

L Outputcollector: Use instances of this class to emit tuples to their output stream

L Guarantee Message processing mechanism

Stream Grouping

One part of defining a topology is specifying which input streams each bolt receives. Stream grouping can define how the stream is divided between bolt tasks.

There are 7 built-in streamgrouping methods in storm that enable you to implement your stream grouping by implementing the Customstreamgrouping interface:

1, Shuffle Grouping:tuples randomly sent to the bolts tasks, so that each bolt can get the same number of tuples.

2, Fields grouping: The data stream is divided by the specific fields in the grouping. For example, if the data flow is grouped by "user-id" field, then tuples is also divided with the same "User-id", which guarantees that a tuple of the same "User-id" can flow to the same task, but with a different "User-id" Tuples may flow to different tasks.

3. All Grouping:stream sends a copy to all tasks of Bolt. Use this method to be careful. （？ , not explained in the original text)

4. Global grouping: The entire flow is directed to one of the bolts tasks. It should be noted that this stream will flow to a task with a minimum ID

5, None grouping: This grouping indicates that you do not need to care about how the stream is grouped. Currently, the nonegrouping mode is the same as the shuffle grouping. Eventually, Storm will push the use of the None grouping method bolts on the same thread, and he will also push these bolts subscriptions bolts or spouts to do the same (when possible).

6, Direct grouping: This is a special grouping

Way. Streams grouped in this way mean that tuples producers decide which consumers to accept these tuples. Direct grouping can only be used for streams that are declared as direct stream. The tuples sent to direct stream must be launched in Emitdirect mode. Bolt can obtain its consumer's taskids by using the existing Topologycontext class, or it keeps track of the emit method output in the Outputcollector class (which returns a tuple-sent task IDs). You can also get taskids.

7, Local or shufflegrouping: If the target bolt has one or more tasks on the same worker process, tuples will be sent randomly to these in-process tasks. Otherwise, this is like the general shufflegrouping.

Information:

L Topologybuilder: Use this class to define topologies

L Inputdeclarer:topologybuilder returns this object whenever a call is made to Setbolt, or to declare the input stream of the bolt and how the streams should be grouped

L Coordinatedbolt: Distributed rpctopologies uses this bolt, and this bolt is important in direct streams and direct grouping. (The address of this class has not been found.)

Reliability

Storm can guarantee that each spout tuple will be fully processed in topology. This is obtained by tracing the tree-like road map formed by the different nodes of the tuples in the topology, and tracing can determine when the tuples tree road Map can be completed successfully. Each topology has a message-timeout mechanism to assist with this trace. If Storm does not detect that a spout tuple has completed its tree view within a limited time, it discards the tuple and sends it again.

If you take advantage of storm reliability features, you must tell storm when new processing edges are created in the tuple tree, and you need to inform storm when to complete the processing of a single tuple. These are done in the Outputcollector object, which is the object that the bolt uses to launch a tuple. Locking the launch destination is done in the emit method, using the ACK method to declare that you have completed the processing of a tuple.

For more detail explanations, you can view the message processing guarantee mechanism.

Tasks

Each spout and bolt can perform a lot of tasks in cluster. Each task is related to the execution thread, stream grouping defines how to send from a series of tasks to another series of tasks. You can use the Setspout and Setbolt methods in the Topologybuilder class to define the parallel mechanism for each spout or bolt.

Workers

Topologies can execute one or more worker processes. Each worker process is a physical JVM (that is, a machine?). ) and perform a subset of all tasks in topology. For example, if the sum of topology parallel numbers is 300 and 50 workers are allocated at the same time, each worker can perform tasks (that is, the number of threads in the worker). Storm tries to distribute tasks evenly among all workers.

Information:

L Config.topology_workers: This configuration option allows you to configure the number of WORKERS in the topology.

Storm Document (7)----Basic concepts

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More