Storm study Notes

Source: Internet
Author: User
Tags ack emit integer numbers prepare shuffle valid zookeeper to domain
http://blog.csdn.net/sheen1991/article/details/51745673
Storm Study Notes First, Introduction

The storm version used in this article is 1.0.1

Storm is a free, open-source distributed real-time computing system that makes it easier to reliably handle unlimited data streams and can handle Hadoop bulk tasks in real time. Storm is easy to use and supports a variety of mainstream programming languages.

Storm has a number of scenarios: real-time analytics, online machine learning, continuous computing, distributed RPC, distributed ETL, easy scaling, and support for fault tolerance to ensure your data is processed, easy to build and manipulate.

The figure below is a concept map of storm "streaming data processing", in which the flow of information flows from the source stream to the stream, passing through each node, and each node processes the data in real time according to its own needs and continues to be sent downstream.

second, Storm architecture

A storm cluster consists of a master node and multiple working nodes. The master node runs a daemon called "Nimbus", so it is also known as the Nimbus node, which is used to assign code, lay out tasks, and detect faults. Each work node runs a daemon called "Supervisor", which also becomes the Supervisor node for listening to work, starting and terminating the worker process. Both Nimbus and supervisor are fast failures (self-destruct is performed whenever any exception is encountered) and are stateless (all states are stored on zookeeper or on disk), so they become very robust. There is also a layer of framework between the Storm's master node and the working node, zookeeper the cluster. Zookeeper is responsible for coordinating work between the Nimbus node and the Supervisor node to manage the different components in the cluster. The Supervisor node contains many worker nodes, and the specific task is performed by the worker node.

The structure of storm is as follows:

In the actual application of the storm cluster, the zookeeper cluster has an official recommended number of nodes of at least 3, corresponding to 3 machines, if necessary, and then expand. The Zookeeper node should be an odd number, because the zookeeper cluster is also divided into a leader node and a number of follower nodes, when the leader node hangs, the need to elect a new leader node, odd-numbered to ensure the successful implementation of the election. The number of storm nodes depends on the throughput requirements of the business, typically requiring one machine as the Nimubus node, and some others as supervisor nodes. The worker node is a logical concept that can have a number of worker nodes on a supervisor machine. The concept of worker nodes is described in more detail below.

Before the Storm1.0.0 version, the Nimbus node in the Storm cluster was a single point, which meant that the Nimbus node would not have an alternate node complement. However, when Storm was originally designed to consider this point, the Nimbus node is only responsible for monitoring the Supervisor status and task distribution, not responsible for the specific execution logic, so the burden of Nimbus node is very please, when the Nimbus hangs off, will not affect the operation of the individual worker nodes, You can continue to work after rebooting. But after all, such a design reduces system fault tolerance, such as when a new task is assigned, or when a supervisor hangs, without the support of the Nimbus node, these tasks cannot be distributed or redistributed. As a result, Storm's post-1.0.0 version provides the HA Nimbus mechanism, similar to the leader and follower nodes of the zookeeper cluster, when a master node is hung up, and another node can be elected as a new Nimbus node. I looked at Apache Storm's official website and found an update on this piece of content, here's the original:

Experienced storm users would recognize that the Storm Nimbus service was not a single point of failure in the strictest Sen SE (i.e loss of the Nimbus node would not affect running topologies). However, the loss of the Nimbus node does degrade functionality for deploying new topologies and reassigning work across a Cluster.

In Storm 1.0 the "soft" point of failure have been eliminated by supporting an HA Nimbus. Multiple instances of the Nimbus service run in a cluster and perform leader election when a Nimbus node fails, and Nimbus The hosts can join or leave the cluster at any time. HA Nimbus leverages the distributed cache API for replication to guarantee the availability of topology resources in the E Vent of a Nimbus node failure.

Links: http://storm.apache.org/2016/04/12/storm100-released.html basic conceptual topology (topology)

The topology of Storm is the encapsulation of application logic for real-time computing, which is similar to the task of mapreduce (job), except that one job of mapreduce always ends after the result is obtained, and the topology runs in the cluster until you manually terminate it. Topologies can also be understood as a topology of Spout and bolts that are interconnected by a series of data streams (stream Grouping). Spout and bolts are called components of the topology (Component).

Use Topologybuilder in Java to build the topology. Stream (data stream)

Data Flow (Streams) is the most central abstraction in Storm. A data flow refers to an unbounded sequence of tuples (tuple) that is created and processed in parallel in a distributed environment. The data flow can be defined by a pattern that can represent the domain (fields) of tuples in the data flow. By default, a tuple (tuple) contains integer (integer) numbers, long digits, short integer numbers, byte (byte), double-precision floating-point numbers (double), single-precision floating-point numbers (float), A primitive type object such as a Boolean value and a byte array. Of course, you can also implement a custom tuple type by defining a serializable object.

When declaring a data flow, you need to define a valid ID for the data flow. However, because Spout and bolts, which are the most used in real-world applications or single data streams, do not require the use of IDs to differentiate data flows, you can use Outputfieldsdeclarer directly to define "no ID" traffic. In fact, the system defines an ID named "Default" for this data flow by default. tuples (Tupe)

A tuple is the smallest data transmission unit in storm, which can be understood as a list of values or a key-value pair, with the key (in some cases called "Domain name" or "field"), In storm using the field class representation) is defined in spout or bolt through the Declareoutputfields () method, and the value is specified in the emit () method. See the following spout/bolt for details. The values in a tuple can be of any type, and the fields of a tuple of a dynamic type are not declared; By default, a tuple in Storm supports a private type, a string, a byte array, and so on as its field value, and if you use a different type, you need to serialize the type.
The field default types of a Tuple are: integer, float, double, long, short, string, Byte, binary (byte[]). The typical TUPE structure is shown in the following figure:
data Source (Spout)

The data source (Spout) is the source of the data flow in the topology. Typically Spout reads tuples from an external data source and sends them to the topology. Depending on the requirements, Spout can be defined either as a reliable data source or as an unreliable data source. A reliable Spout is able to resend the tuple when it sends a tuple processing failure, to ensure that all tuples are handled correctly, and that the unreliable Spout does not do any other processing of tuples after the tuple has been sent.

The key methods in spout are ACK, fail, open, Nexttupe, and Declareoutputfields, which are described below.

Nexttupe: The first is the most critical method Nexttuple, which is the spout to send a tuple Tupe execution method. As the name implies, Nexttuple either sends a new tuple to the topology or returns directly when there are no tuples to send. method, the tuple is emitted by calling Spoutoutputcollector's emit method. It is important to note that because Storm is calling all Spout methods in the same thread, Nexttuple cannot be blocked by any of the other functional methods of Spout, otherwise it will directly result in the interruption of the data flow (for this, Ali's jstorm modifies the Spout model, using different Thread to handle the sending of messages, this practice has advantages and disadvantages, the advantage is that it can be more flexible implementation of Spout, the disadvantage is that the system scheduling model is more complex, how to choose or to see the specific needs of the scene it--the translator note.

The other two key methods in **ack and Fail:**spout are ACK and fail, which are used for further processing after Storm detects that a sent tuple has been successfully processed or failed to process. Note that the ACK and fail methods are only valid for the "reliable" Spout described above. Reliable data flow ensures that the Ganso sent out by spout can receive feedback to ensure that it is processed, but it also increases the overhead of the system, which requires a tradeoff between reliability and performance in the design process based on business requirements.

Reliable spout

Collector.emit (List tuple, Object messageId);

unreliable spout.

Collector.emit (List tuple);

When using a reliable message assurance mechanism, the Execute method that requires the downstream bolt is called:

Collector.ack (Tupe input);

To notify Upstream, the TUPE has been processed by the current node. When spout does not wait for an ACK acknowledgement or fail acknowledgement within the timeout period, storm will consider the node processing failure, call spout's Fail method, and, accordingly, invoke the Ack method of spout if the processing succeeds. For more detailed information, please see below or link "message Reliability assurance"

Declareoutputfields: This is a useful way to define different data streams and tuples in spout. In a topology, a spout may send a lot of data messages, while downstream may have many bolt components receiving spout messages, but often a bolt does not want to receive all the data from spout, and may only need to receive some data from a certain type of data stream. Storm provides us with a "subscription" mechanism where spout can send a wide variety of data streams, while downstream bolts can subscribe to their own needs, and the key approach to implementation is declareoutputfields. The following is a typical implementation:

 @Override public void Declareoutputfields (Outputfieldsdeclarer arg0) {arg0.declare (The new Fields ("Institude", "Institud
    E2 "," Institude3 "));
Arg0.declarestream ("Teststreamid", New Fields ("Institude4", "INSTITUDE5"));
    } @Override public void Nexttuple () {Fraudtrans ft = (fraudtrans) input.getvalue (0);
    Fraudtrans ft2 = new Fraudtrans ();
    Fraudtrans ft3 = new Fraudtrans ();

    List anchors = new ArrayList ();
    Collector.emit (New Values (ft,ft2,ft3));

Collector.emit ("Teststreamid", New Values (FT2, ft3)); }

Where the first sentence of the Declareoutputfields method declares a tuple containing three fieldname, the data value corresponding to the declared fieldname is sent in the first emit in Nexttupe In the second sentence of Declareoutputfields, a tuple of two fieldname is defined, and it is bound to a data flow with a stream ID of "Teststreamid" and a stream of that ID in Nexttupe. In this way, the bolt at the back can choose to subscribe or not subscribe to the Streamid data stream, and get a value corresponding to a fieldname, rather than receiving all the data from spout.

The **open:**open method is called by storm for us when we open the spout data source, which is equivalent to an initialization method, so that some of the necessary initialization code can be executed. It is important to note, however, how many threads (executor) are in the concurrent spout, and how many times the method is called. The concept of executor will be mentioned below. Therefore, in the actual project, some of the resource issues involved in the service need to be used cautiously.

Precautions when using the spout

The most common pattern is to use a thread-safe queue, such as the blockingqueue,spout main thread, to read data from a queue, and one or more threads to read data from a data source (such as various message middleware, DB, etc.) and put it into a queue.

Do not enable the ACK mechanism if you are not concerned about the loss of data (such as a typical scenario for data statistics analysis).

Spout's nexttuple and ACK methods are executed in the same thread (which may initially feel like this is not a bottleneck, and for simple implementations, a single thread, Jstorm should have been changed to multithreading), It is therefore not possible to block the current thread in the Nexttuple or Ack method, which will directly affect the processing speed of the spout, which is critical.

spout nexttuple send data, you cannot block the current thread (see the previous bar), such as fetching data from a queue, using the poll interface instead of take, and the poll method to avoid blocking fixed time, if no data in the queue is returned directly If there are multiple data to be sent, the traversal is all sent out at one call to Nexttuple.

Spout from 0.8.1 after calling the Nexttuple method, if there is no emit tuple, then the default needs to hibernate 1ms, this specific policy is configurable, so can be set according to their own specific scenarios, to achieve reasonable utilization of CPU resources.

For a specified spout Task, there may be some outgoing tupe waiting to be processed, no ACK or fail acknowledgement, then the TUPE will be pending suspended so as to not affect subsequent tupe send. The config.topology_max_spout_pending configuration allows you to set the maximum number of Tupe allowed pending, such as setting to 1000/5000 or other reasonable values to prevent Tupe from suspending excessive burst memory. It is recommended to set this in a production environment, but the value is how much it needs to be set reasonably according to system throughput, because too small may be too slow to assemble, too much memory for assembly. It is also important to note that this configuration is valid only for reliably processed tupe.

Note: Declare and emit specify Streamid data flow, allow not to be subscribed, once subscribed, the streamid of the subscription must already be defined, the topology task will be error invalidtopologyexception when committing. Data Flow processing component (Bolt)

All data processing in the topology is done by bolts. Through data filtering (filtering), function processing (functions), aggregation (aggregations), Junction (joins), database interaction and other functions, Bolt can almost complete any kind of data processing requirements.

A bolt enables simple data flow transformations, while more complex data flow transformations typically require multiple bolts and are completed in multiple steps. For example, a data stream that converts a microblogging data flow to a trend image has at least two steps: One bolt is used to scroll through the tweets of each image, and the other or more bolts output the data stream as the "most forwarded picture" result (with 2 bolts, if 3 Bolt you can make this transition more scalable).

The key methods in Bolt are execute, prepare, declareoutputfields and cleanup.

Declareoutputfields: As with Spout, bolts can also output multiple data streams. To achieve this, you can declare a different data stream by declaring it with the Declarestream method of Outputfieldsdeclarer, and then in the emit method of Outputcollector when the data is sent, the data flow ID The ability to send data as a parameter.

When defining the input data flow for a Bolt, you need to subscribe to the specified data stream from other Storm components. If you need to subscribe to data streams from all other components, you must register each component separately when you define the Bolt.

The key method of **execute:**bolt is the Execute method. The Execute method is responsible for receiving a tuple as input and sending a new tuple using the Outputcollector object. At the time of receiving, the specified tuple can be obtained through the Tupe.getvaluebyfield () method, or it can be received or received by the subscript of the tuple list.

If there is a need for message reliability assurance, Bolt must call Outputcollector's Ack method for each tuple it handles so that Storm can understand whether the tuple is processing complete (and ultimately decide whether it can respond to the original Spout output tuple tree). In general, for each input tuple, you can choose not to send or send multiple new tuples after processing, and then respond (ACK) to the input tuples. The Ibasicbolt interface enables automatic response of tuples.

For Topology,bolt that need to ensure message reliability, it is also necessary to anchor the incoming Tupe as anchor when the data is emit, related concepts see below or link "message Reliability assurance"

Prepare: This method is similar to the open method in spout, which is called when the bolt is initialized. Similarly, if a bolt has multiple executor threads, the method will be executed multiple times.

Cleanup: Executes when the bolt is closed and can release some resources. It is important to note that in local mode (Localcluster), this method is bound to execute, but in cluster mode, Storm does not guarantee that the method will be executed. data stream Grouping (stream Grouping)

Determining the input data flow for each Bolt in the topology is an important part of defining a topology. The data flow grouping defines how the data flow is partitioned in different tasks of the Bolt.

There are eight built-in data flow groupings in Storm (the original is wrong, there are already eight grouping models-the translator note), and you can also implement a custom data flow grouping model through the Customstreamgrouping interface. These eight grouping ticks are: Random grouping (Shuffle grouping): In this way, the tuples are randomly assigned to different tasks (tasks) of the Bolt, so that the number of tuples processed by each task can remain basically consistent to ensure load balancing of the cluster. Domain grouping (fields grouping): This way the data flow is grouped according to the defined "domain". For example, if a data flow is grouped based on a field named "User-id", then all tuples that contain the same "User-id" are assigned to the same task, ensuring consistency of message processing. Partial keyword grouping (partial key grouping): This approach is similar to domain groupings, where data flows are grouped according to defined domains, but this approach takes into account the equalization of downstream Bolt data processing, with better performance when the input data source keyword is unbalanced 1. Interested readers can refer to this paper, which explains in detail how this grouping works and its merits. Complete grouping (all grouping): In this way the data flow is sent to all tasks of the Bolt at the same time (that is, the same tuple is copied multiple copies and then processed by all tasks), so use this grouping with special care. Global grouping: All traffic in this way will be sent to the same task of the Bolt, which is the task with the smallest ID. Non-grouping (None grouping): Use this way to show that you don't care how data flows are grouped. The results of this approach are now completely equivalent to random groupings, but the future Storm community might consider using a non-grouping approach to make bolts and the Spout or bolts it subscribes to execute in the same thread. Direct grouping: This is a special way of grouping. Using this approach means that the sender of a tuple can specify which task downstream can receive the tuple. You can use direct grouping only if the data flow is declared as a direct data stream. Sending tuples using direct data streams requires using one of the emitdirect methods of Outputcollector. Bolt can obtain the task ID of its downstream consumer through topologycontext, or it can trace the emit method of the Outputcollector (the method returns the element it sendsThe ID of the target task for the group) to get the task ID. Local or Shuffle grouping: If the target Bolt has one or more task threads in the worker process of the source component, the tuples are randomly assigned to those tasks that are in the same process. In other words, this has a similar effect to the way random groups are grouped.

The way a particular bolt is grouped is declared when the topology is defined, and the following is a typical implementation:

    Topologybuilder builder = new Topologybuilder ();
    Builder.setspout ("DataSource", New Dataspout ());
    Randomly grouped Builder.setbolt ("Statistics", New Statisticsbolt ()). Shufflegrouping ("datasource"); Grouping by fields ensures that messages of the same field are executed by the same task Builder.setbolt ("Institude_analysis", New Institudeanalysisbolt (), 2). Fi
    Eldsgrouping ("Statistics", New Fields ("Institude", "Institude2")); Only the data stream with ID Streamid is subscribed builder.setbolt ("Institude_analysis2", New InstitudeAnalysisBolt2 (), 2). Fieldsgroupi
    Ng ("Statistics", "Teststreamid", New Fields ("Institude4", "INSTITUDE5")); 

    Builder.setbolt ("Fraudsource_analysis", New Fraudsourceanalysisbolt ()). Shufflegrouping ("institude_analysis");
    Config conf = new config ();

    Conf.setdebug (FALSE);
    Localcluster cluster = new Localcluster ();
    Cluster.submittopology ("TestA", conf, Builder.createtopology ());
    Utils.sleep (120000); Cluster.killtopolOgy ("TestA");
 Cluster.shutdown ();
Reliability Assurance (reliability)

Storm can use topologies to ensure that each sent tuple is handled correctly, or handled completely. By tracking the tuple tree consisting of each tuple emitted by Spout, you can determine whether the tuple has finished processing.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.