Twitter Storm Learning Notes

Last Update:2015-03-05 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Official English Document: http://storm.apache.org/documentation/Documentation.html

This article is to study notes, reproduced integration plus translation, mainly for the convenience of learning.

first, the basic concept

Reference: http://storm.apache.org/documentation/Concepts.html

This section goes from: http://xumingming.sinaapp.com/117/twitter-storm%E7%9A%84%E4%B8%80%E4%BA%9B%E5%85%B3%E9%94%AE%E6%A6%82%E5%BF%B5/

Topologies
Streams
Spouts
Bolts
Stream groupings
Reliability
Tasks
Workers
Configuration

Let's look at one of the various objects in storm:

All the objects in storm

Calculate the complement: topologies

The logic of a real-time computing application is encapsulated in storm inside the topology object, and I call it computational extension. The topology in storm is the equivalent of a mapreduce job in Hadoop, and the key difference is that a mapreduce job will eventually end, but a storm's topoloy will always run-unless you explicitly kill it. A topology is a graph structure composed of spouts and bolts, while links spouts and bolts are stream groupings.

Message Flow: Streams

The message flow is the most critical abstraction inside storm. A message flow is a tuple sequence without boundaries, and these tuples are created and processed in parallel in a distributed manner. The definition of a message flow is primarily the definition of a tuple within a message flow, and we give each field a name in the tuple. and the corresponding fields of the different tuple types must be the same. That is, the first field of the two tuple must be the same type, the second field must be the same type, but the first field and the second field can have different types. By default, the field type of a tuple can be: integer, long, Short, Byte, string, double, float, Boolean, and byte array. You can also customize the type-as long as you implement the corresponding serializer.

Each message flow is assigned to an ID when it is defined, because the one-way message flow is so prevalent that outputfieldsdeclarer defines a way for you to define a stream without specifying the ID. In this case, the stream will have a default id:1.

Message Source: spouts

Message source spouts is a topology inside the storm inside a message producer. In general, the message source reads the data from an external source and sends a message to the topology: tuple. The message source spouts can be either reliable or unreliable. A reliable message source can re-launch a tuple if the tuple is not handled successfully by storm, but an unreliable source of the message spouts once a tuple is issued, it is completely forgotten-it is impossible to send again.

A message source can emit multiple message stream streams. To achieve this effect, use Outfieldsdeclarer.declarestream to define multiple streams, and then use Spoutoutputcollector to emit the specified sream.

The most important method in the spout class is to nexttuple either launch a new tuple into the topology or simply return it if there are no new tuples. It is important to note that the Nexttuple method cannot be implemented with block spout because Storm invokes all the message source spout methods on the same thread.

The other two more important spout methods are ACK and fail. Storm calls the ACK when it detects that a tuple is successfully processed by the entire topology, otherwise it calls fail. Storm only calls ACK and fail for reliable spout.

Message Processor: Bolts

All the message processing logic is encapsulated inside the bolts. Bolts can do a lot of things: filtering, aggregating, querying databases, and so on.

Bolts can simply do message flow delivery. Complex message flow processing often takes a lot of steps, and therefore requires a lot of bolts. For example, to figure out a bunch of images that are forwarded the most are two steps at least: the first step is to figure out the number of forwards for each image. The second step is to find the top 10 forwarded images. (If you want to make this process more scalable, you may need more steps).

Bolts can emit multiple message flows, define the stream using Outputfieldsdeclarer.declarestream, and use Outputcollector.emit to select the stream to be emitted.

The main method of bolts is execute, which takes a tuple as input, bolts uses Outputcollector to emit a tuple, bolts must call Outputcollector's Ack method for each tuple it processes, To inform storm that the tuple has been processed. – thus we inform the sender of this tuple spouts. The general process is: bolts processes an input tuple, emits 0 or more tuples, and then calls ACK to notify Storm that he has already processed the tuple. Storm provides a ibasicbolt that automatically calls ACK.

Stream groupings: Message distribution Policy

One step in defining a topology is to define what stream each bolt accepts as input. Stream grouping is used to define a stream that should be assigned to multiple tasks above bolts.

There are 6 kinds of stream grouping in storm:

Shuffle Grouping: Randomly distribute the tuple within the stream to ensure that each bolt receives the same number of tuples.
Fields Grouping: Grouped by field, such as by UserID, a tuple with the same userid is divided into the same bolts, and the different userid is assigned to a different bolts.
All Grouping: Broadcast sent, for each tuple, all bolts will receive.
Global Grouping: Globally grouped, this tuple is assigned to one of the bolt's tasks in storm. More specifically, the task assigned to the lowest ID value.
Non Grouping: No grouping, this grouping means that stream does not care who will receive its tuple. At present, this grouping and shuffle grouping is the same effect, a little different is that storm will put this bolt in the same thread as the Subscriber to execute.
Direct Grouping: A very special grouping method, which means that the sender of the message specifies which task of the message receiver handles the message. Only message flows that are declared as direct stream can declare this grouping method. And this message tuple must use the Emitdirect method to launch. The message processor can get the taskid of the message that handles it by Topologycontext (the Outputcollector.emit method also returns TaskID)

Reliability

Storm guarantees that each tuple will be fully executed by topology. Storm tracks the tuple tree generated by each spout tuple (a bolt that processes a tuple may emit another tuple to form a tree structure) and tracks when the tuple tree is successfully processed. Each topology has a message time-out setting, and if Storm does not detect whether a tuple tree is successful at this time-out, then topology will mark the tuple as execution failure and will re-launch the tuple in a few minutes.

To take advantage of the storm's reliability characteristics, you must notify storm when you issue a new tuple and you finish processing a tuple. All of this is done by Outputcollector. By its emit method, a new tuple is notified that a tuple process has been completed by its Ack method.

Tasks: Task

Each spout and Bolt is executed as a lot of tasks throughout the cluster. Each task corresponds to a thread, while stream grouping defines how to emit a tuple from a heap of tasks to another heap of tasks. You can call Topologybuilder.setspout () and Topbuilder.setbolt to set the degree of parallelism-that is, the number of tasks.

Work process

A topology may be executed within one or more worker processes, and each worker process executes part of the entire topology. For example, for topology with a degree of parallelism of 300, if we use 50 worker processes to execute, then each worker process will handle 6 tasks (which is actually allocating 6 threads per worker process). Storm will try to distribute the work evenly to all the work processes.

Configuration

There's a bunch of parameters in storm that can be configured to adjust the behavior of Nimbus, supervisor, and running topology, some configured at the system level, some configured at topology level. The default configuration for all configurations with default values is configured in Default.xml. You can override these default configurations by defining a storm.xml in your classpath centimeters. And you can also set some topology-related configuration information in the code – using Stormsubmitter. Of course, the priority of these configurations is: Default.xml < Storm.xml < topology-specific configuration.

Ii. Some of the roles in the Storm cluster

Resources:

Http://www.aboutyun.com/thread-6873-1-1.html

Nimbus: Responsible for resource allocation and task scheduling.
Supervisor: Responsible for accepting tasks assigned by Nimbus, starting and stopping worker processes that belong to their own management.
Worker: A process that runs a specific processing component logic.
Each Spout/bolt thread in the Task:worker is called a task. After storm0.8, the task is no longer corresponding to the physical thread, and the same Spout/bolt task may share a physical thread called executor.

The following diagram depicts the relationships between the above roles. Third, Storm multi-language support

1.ShellBolt principle

Topology is used as the basic unit of operation in Storm. And topology is made up of spout and bolts, in fact spout is data access, and bolt is the real processor of the data in topology.

So we just need to be able to encapsulate the program as a bolt in a topology. Storm's proposed Shellbolt the function, Shellbolt is essentially a shell program that allows developers to encapsulate their programs (arbitrary programs) into a shellbolt to run in topology. Isn't it amazing?

Let's take a look at the principle of shellbolt.

1) Shellbolt is essentially a bolt;

2) receive the shell command in Shellbolt and create a shellprocess based on the shell command call. and start two threads, send messages and read messages to the shellprocess;

3) Further, Shellprocess actually called Processbuild to create a process, and through the process of Inputstream,outputstream, The errorstream interacts with the process.

Here, things seem to have become clearer:

Shellbolt is essentially a shell command that initiates a new process and interacts with it through the stdin,stdout,stderr of the process.

The method seems simple, in fact contains the rationale "Simple is the best".

2.ShellBolt problems that may exist

It says that Shellbolt starts a new process through the shell command, and interacts with it through the stdin,stdout,stderr of the process. This could lead back to three questions:

1) Hidden risks

The interaction is completely the stdin,stdout of the reading and writing process, which puts forward the requirement for the programmer not to export things to stdin and StdOut privately, that is to say, there is absolutely no printf,scanf,cout,cin,system.out in the program. System.in or something like that.

This, for small-scale programs or completely newly developed programs can be agreed. However, for the program, or based on the existing program to reconstruct the time has become unreliable, only God knows if anyone secretly wrote stdin or STDOUT, which will inevitably lead to a program crash. And you can't even find out.

2) Low efficiency

or the interaction, through the stdin,stdout interaction of the read and write process, all data must be text, storm in the implementation of JSON encoding. While coding and decoding, writing stdio,stdout, the efficiency of this interactive mode is undoubtedly relatively low.

3) Zombie Process

Shellbolt starts a new process based on the shell command, and there is no good way to ensure that it kills all of the processes he initiates. I often found in the process of the use of a topology stop, the background will often reside there is still not dead process.

4) Resource Usage

Shellbolt starts a new process based on the shell command. In other words, when a task is more, it starts many processes and compares resources. Of course this can not be said to be a disadvantage, because the process is a separate space, when your program needs more resources, start a separate process is a good choice.

Iv. Storm Performance Test Reference: http://blog.csdn.net/jmppok/article/details/17614431 g)Performance of Storm when using external handlers

This test case primarily tests the overall performance of the system in the case of using an external handler. When using an external handler, storm runs the external handler as a subroutine and uses the JSON format to exchange data. In this test we use a Python script as an external handler.

The test schematic is as follows:

In the test, sender,processer and so on are single-node, so this test result is the processing ability of individual processing line. The test results are as follows:
CPU utilization is as follows
CPU utilization of each process
Memory usage
Throughput (TUPLES/S)
Tuple processing delay

Test results Analysis: By the above test is visible, the use of external processing program, the system's processing capacity is only tuple/s, performance degradation is obvious. The analysis considers that the performance of steep drop is due to two reasons: 1) All the tuple has been in JSON format to interact with external programs, the process of format conversion consumes CPU circle; 2) storm uses the external handlers as sub-processes to communicate with the Linux pipeline because the Linux pipeline (pipe) The use of 4K size of the page to do the relay, so in the large amount of data will have a performance loss, testing each message at least 1K bytes, so will soon be used to use the pipe to run out of memory waiting. Increase the number of external programs (that is, the parallelism of the Processer processing unit, not exceeding the number of CPUs in the system), the performance basically has a linear increase.

The results of processing delays are visible, when using an external handler, the tuple processing delay is about 10 times times greater than using Storm's built-in processing mechanism.

Test conclusion

After the above test, we can draw the following conclusions:

Storm single pipeline processing capacity is approximately 20000 tupe/s (1000 bytes per tuple size)
The storm system's processing latency in the province is millisecond-level
Scale-out in a cluster increases the processing power of the system by 1.6 times times the measured result
A large number of threads are used in storm, with more than 10 threads running at the same time even on a single line of systems, so almost all 16 CPUs are running, and load average is about 3.5
JVM GC generally has limited impact on system performance, but when memory is tight, GC can be a bottleneck
Performance degradation with external handlers is obvious, so use the storm built-in processing mode as much as possible under high performance requirements

Twitter Storm Learning Notes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More