Twitter Storm Learning II-Introduction to Basic concepts

Last Update:2015-11-28 Source: Internet

Author: User

Tags emit

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

2.1 Storm Basic Concepts

Before running a storm task, you need to know some concepts:

Topologies
Streams
Spouts
Bolts
Stream groupings
Reliability
Tasks
Workers
Configuration

The storm cluster and the Hadoop cluster surface look very similar. But it's the mapreduce jobs that runs on Hadoop, and the Topology (topology) that runs on Storm is very different. One key difference is that a mapreduce job will eventually end, and a topology will always run (unless you kill manually).

There are two types of nodes in the Storm cluster: The control node (master nodes) and the worker node. The control node runs a daemon called Nimbus, which acts like a jobtracker inside Hadoop. Nimbus is responsible for distributing the code within the cluster, assigning compute tasks to the machine, and monitoring the status.

Each work node runs a node called supervisor. The supervisor listens to the work assigned to it and starts/shuts down the work process as needed. Each worker process executes a subset of a topology, and a running topology consists of many working processes running on many machines.

All coordination between Nimbus and supervisor is done through the zookeeper cluster. In addition, both the Nimbus process and the supervisor process are fast failures (fail-fast) and stateless. All states are either inside the zookeeper or on a local disk. This also means that you can use kill-9 to kill Nimbus and supervisor processes, and then restart them as if nothing had happened. This design makes the storm unusually stable.

2.1.1 Topologies

A topology is a graph of spouts and bolts that connects spouts and bolts in the diagram via stream groupings, such as:

A topology will run until you kill manually, Storm automatically reassign failed tasks, and storm can guarantee that you won't have data loss (if high reliability is turned on). If some machines stop unexpectedly, all the tasks above it will be transferred to other machines.

Running a topology is simple. First, put all your code and the jar you depend on into a jar package. Then run this command similar to the following:

Storm jar All-my-code.jar backtype.storm.MyTopology arg1 arg2

This command will run the main class: backtype.strom.MyTopology, the parameter is arg1, arg2. The main function of this class defines the topology and submits it to Nimbus. The Storm Jar is responsible for connecting to the Nimbus and uploading the jar package.

Topology is defined as a thrift structure, and Nimbus is a thrift service, and you can submit topology created in any language. The above aspect is the simplest method of submission in jvm-based language.

2.1.2 Streams

The stream of messages is a key abstraction in storm. A message flow is a tuple sequence without boundaries, and these tuple sequences are created and processed in a distributed manner in parallel. Define a stream by naming each field in a tuple sequence in the stream. By default, the field types of a tuple can be: Integer,long,short, Byte,string,double,float,boolean, and byte array. You can also customize the type (as long as you implement the appropriate serializer).

Each message flow is assigned to an ID when it is defined, because a one-way message flow is fairly common, and outputfieldsdeclarer defines methods that allow you to define a stream without specifying the ID. In this case, the stream will be assigned a value of ' default ', which defaults to the ID.

Storm provides the most basic source of processing stream is spout and bolt. You can implement the interfaces provided by spout and bolts to handle your business logic.

2.1.3 Spouts

Message source spout is a topology inside the storm inside a message producer. In general, the message source reads the data from an external source and sends a message to the topology: tuple. The spout can be reliable or unreliable. If the tuple is not successfully processed by storm, a reliable source spouts can re-emit a tuple, but unreliable message sources spouts cannot resend once a tuple is issued.

A message source can emit multiple message stream streams. Use Outputfieldsdeclarer.declarestream to define multiple streams, and then use Spoutoutputcollector to emit the specified stream.

The most important method in the spout class is nexttuple. Either launch a new tuple into topology or simply return if there is no new tuple. Note that the Nexttuple method cannot be blocked because storm invokes all the message source spout methods on the same thread.

The other two more important spout methods are ACK and fail. Storm calls the ACK when it detects that a tuple is successfully processed by the entire topology, otherwise it calls fail. Storm only calls ACK and fail for reliable spout.

2.1.4 Bolts

All the message processing logic is encapsulated inside the bolts. Bolts can do a lot of things: filtering, aggregating, querying databases, and so on.

Bolts can simply do message flow delivery. Complex message flow processing often takes a lot of steps, and therefore requires a lot of bolts. For example, to figure out a bunch of images that are forwarded the most are two steps at least: the first step is to figure out the number of forwards for each image. The second step is to find the top 10 forwarded images. (If you want to make this process more scalable, you may need more steps).

Bolts can emit multiple message flows, define the stream using Outputfieldsdeclarer.declarestream, and use Outputcollector.emit to select the stream to be emitted.

The main method of bolts is execute, which takes a tuple as input, and bolts uses Outputcollector to emit tuple,bolts The Ack method that must be called outputcollector for each tuple it handles. To inform storm that the tuple has been processed, notifying the sender of the tuple spouts. The general process is: bolts processes an input tuple, emits 0 or more tuples, and then calls ACK to notify Storm that he has already processed the tuple. Storm provides a ibasicbolt that automatically calls ACK.

2.1.5 Stream Groupings

One step in defining a topology is to define what streams each bolt receives as input. Stream grouping is used to define a stream that should assign data to multiple tasks above bolts.

There are 7 types of stream in storm grouping

Shuffle Grouping: Randomly distribute the tuple within the stream to ensure that each bolt receives approximately the same number of tuples.
Fields Grouping: Grouped by field, such as by UserID, a tuple with the same userid is assigned to a task in the same bolts, and the different userid is allocated to a task in a different bolts.
All Grouping: Broadcast sent, for each tuple, all bolts will receive.
Global Grouping: Globally grouped, this tuple is assigned to one of the bolt's tasks in storm. More specifically, the task assigned to the lowest ID value.
Non Grouping: No grouping, this grouping means that stream does not care who will receive its tuple. At present, this grouping and shuffle grouping is the same effect, a little different is that storm will put this bolt in the same thread as the Subscriber to execute.
Direct Grouping: A very special grouping method, which means that the sender of the message specifies which task of the message receiver handles the message. Only message flows that are declared as direct stream can declare this grouping method. And this message tuple must use the Emitdirect method to launch. The message processor can get the ID of the task that handles its message through Topologycontext (the Outputcollector.emit method also returns the ID of the task).
Local or Shuffle grouping: If the target bolt has one or more tasks in the same worker process, the tuple will be randomly assigned to those tasks. Otherwise, it is consistent with normal shuffle grouping behavior.

2.1.6 Reliability

Storm guarantees that each tuple will be fully executed by topology. Storm tracks the tuple tree generated by each spout tuple (a bolt that processes a tuple may emit another tuple to form a tree structure) and tracks when the tuple tree is successfully processed. Each topology has a message time-out setting, and if Storm detects that a tuple tree is not successful at this time-out, then topology will mark the tuple as execution failure and re-launch the tuple in a few moments.

To take advantage of the storm's reliability characteristics, you must notify storm when you issue a new tuple and you finish processing a tuple. All of this is done by Outputcollector. The emit method is used to notify a new tuple to be generated, and the Ack method notifies a tuple that the processing is complete.

The reliability of storm we'll go into the fourth chapter.

2.1.7 Tasks

Each spout and Bolt is executed as a lot of tasks throughout the cluster. Each executor corresponds to a thread, runs multiple tasks on this thread, and stream grouping defines how to emit a tuple from a heap of tasks to another heap of tasks. You can call the Topologybuilder class's setspout and Setbolt to set the degree of parallelism (that is, how many tasks are there).

2.1.8 Workers

A topology may be executed in one or more workers (worker processes), each worker being a physical JVM and performing part of the entire topology. For example, for a topology with a degree of parallelism of 300, if we use 50 worker processes to execute, then each worker process will handle 6 tasks. Storm will distribute all the workers as evenly as possible.

2.1.9 Configuration

There's a bunch of parameters in storm that can be configured to adjust the behavior of Nimbus, supervisor, and running topology, some configured at the system level, some configured at topology level. Default.yaml has all the default configurations. You can override these default configurations by defining a Storm.yaml in your classpath. And you can also set some topology-related configuration information in the code (using Stormsubmitter).

2.2 Build Topology1. Goals to achieve:

We're going to design a topology to make a count of how often words appear in a sentence. This is a simple example, the purpose is to let everyone on the topology quickly get started, have a preliminary understanding.

2. Design topology structure:

The first step in developing a storm project is to design topology. Make sure your data processing logic, the simple example that we will be topology today, is also very simple. The entire topology is as follows:

The entire topology is divided into three parts:

Kestrelspout: Data source, responsible for sending sentence

Splitsentence: Responsible for splitting the sentence

Wordcount: Responsible for summing up the frequency of words

3. Design Data Flow

This topology reads the sentences from the Kestrel queue and divides the sentences into words, then summarizes the number of occurrences of each word, a tuple is responsible for reading the sentences, and each tuple corresponds to the number of occurrences of each word, presumably as shown below:

4. Code implementation:

1) Build the MAVEN environment:

To develop storm topology, you need to add storm-related jar packages to classpath: either manually add all the relevant jar packages or use MAVEN to manage all dependencies. Storm's Jar package is published in Clojars (a maven library), and if you use MAVEN, add the following configuration to your project's pom.xml.

<id>clojars.org</id>

<url>http://clojars.org/repo</url>

</repository>

<groupId>storm</groupId>

<artifactId>storm</artifactId>

</dependency>

2) Define topology:

Topologybuilder builder = new Topologybuilder ();

Builder.setspout (1, New Kestrelspout ("kestrel.backtype.com", 22133,

"Sentence_queue",

New Stringscheme ()));

Builder.setbolt (2, New Splitsentence (), 10)

. shufflegrouping (1);

Builder.setbolt (3, New WordCount (), 20)

. fieldsgrouping (2, New fields ("word"));

This topology spout reads the sentence from the sentence queue, where kestrel.backtype.com is located on a Kestrel server port 22133.

Spout inserts a unique ID into topology with the Setspout method. Each node in the topology must be given a id,id that is used by other bolts to subscribe to that node's output stream. The ID of Kestrelspout in topology is 1.

Setbolt is used to insert bolts in topology. The first bolts defined in topology is the bolts of the cut sentence. The bolts the sentence into a stream of words.

Let's take a look at splitsentence implementation:

public class Splitsentence implements ibasicbolt{

public void Prepare (MAP conf, Topologycontext context) {

}

public void execute (tuple tuple, basicoutputcollector collector) {

String sentence = tuple.getstring (0);

For (String word:sentence.split ("")) {

Collector.emit (New Values (word));

}

public void Cleanup () {

}

public void Declareoutputfields (Outputfieldsdeclarer declarer) {

Declarer.declare (New fields ("word"));

}

The key approach is the Execute method. As you can see, it splits the sentence into words and emits each word as a new tuple. Another important method is declareoutputfields, which declares the schema of the bolts output tuple. Announce here that it emits a tuple of fields for word

The last parameter of Setbolt is the amount of parallelism you want for bolts. The splitsentence bolts is 10 concurrent, which causes 10 threads in the storm cluster to execute in parallel. All you have to do is increase the amount of parallelism in the bolts when encountering topology bottlenecks.

The Setbolt method returns an object that defines the input for the bolts. For example ,thesplitsentence Bolt Subscription Component "1" uses a randomly grouped output stream. "1" means that kestrelspouthas been defined. I will explain the part of the random grouping at a certain moment. So far, the most important thing is thatsplitsentence bolts consumes every tuple emitted by kestrelspout .

Now let's look at the implementation of WordCount:

public class WordCount implements Ibasicbolt {

Private map<string, integer> _counts = new hashmap<string, integer> ();

public void Prepare (MAP conf, Topologycontext context) {

}

public void execute (tuple tuple, basicoutputcollector collector) {

String Word = tuple.getstring (0);

int count;

if (_counts.containskey (word)) {

Count = _counts.get (word);

} else {

Count = 0;

}

count++;

_counts.put (Word, count);

Collector.emit (New Values (Word, count));

}

public void Cleanup () {

}

public void Declareoutputfields (Outputfieldsdeclarer declarer) {

Declarer.declare (New fields ("word", "count"));

}

Splitsentence for each word inside the sentence to launch a new tuple, wordcount in memory to maintain a word---number of mapping, wordcount each received a word, it updates the memory of the statistical state.

5. Running topology

Storm runs in two modes: local mode and distributed mode.

1) Local mode:

Storm uses threads inside a process to simulate all spout and bolts. Native mode is more useful for development and testing. When you run Storm-starter inside the topology, they run in local mode, and you can see what messages are being fired from each component in topology.

2) Distributed mode:

Storm is made up of a bunch of machines. When you submit topology to master, you also submit the topology code. Master is responsible for distributing your code and assigning the work process to your topolgoy. If a worker process is hung up, the master node will reassign the thought to another node.

3) Here is the code that runs in local mode:

Config conf = new config ();

Conf.setdebug (TRUE);

Conf.setnumworkers (2);

Localcluster cluster = new Localcluster ();

Cluster.submittopology ("Test", Conf, Builder.createtopology ());

Utils.sleep (10000);

Cluster.killtopology ("test");

Cluster.shutdown ();

First, this code defines a cluster within a process by defining a Localcluster object. Committing the topology to this virtual cluster and committing the topology to the distributed cluster is the same. By calling the Submittopology method to commit the topology, it accepts three parameters: the name of the topology to run, a configuration object, and the topology itself to run.

Topology's name is used to uniquely differentiate a topology, so you can then use that name to kill the topology. As already mentioned, you have to explicitly kill a topology or it will run all the time.

Conf objects can be configured with a lot of things, the following two are the most common:

Topology_workers (setnumworkers) defines how many worker processes you want the cluster to assign to you to perform this topology. Each component inside the topology is executed by the required thread. How many threads each component uses are specified by Setbolt and setspout. These threads are running inside the worker process. Each worker process contains some worker threads for some nodes. For example, if you specify 300 threads, 60 processes, then 6 threads are executed within each worker process, and 6 threads may belong to different components (Spout, bolts). You can adjust the performance of topology by adjusting the degree of parallelism for each component and the number of processes in which those threads are located.

Topology_debug (Setdebug), when it is set to true, Storm records every message emitted by each component. This is useful for debugging topology on the local environment, but doing so on-line can affect performance.

Conclusion:

This chapter explains the construction and definition of topology from a simple example, from the definition of the basic object of storm to the extensive introduction of storm's development environment. I hope you can have a basic understanding and concept of storm from the content of this chapter, and can already build a simple topology!!

Twitter Storm Learning II-Introduction to Basic concepts

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More