Analysis of flow-processing framework Storm

Source: Internet
Author: User
Tags ack emit integer numbers shuffle to domain

Objective
During the previous period, I participated in the design of sentinel flow monitoring function, and researched two frameworks that can do flow calculation: Storm and spark streaming, I am in charge of Storm's research work. It took a week to read the doc on the official website and some information on the web. I summarize what I've learned into a document and issue it to get started with colleagues who are interested in storm.

Storm background
With the further development of the Internet, from the Portal information browsing type to search information searching type to SNS interactive Transfer type, as well as e-commerce, internet tourism life products will be in the life of the circulation link online. The requirement of efficiency makes the requirement of real-time increase further, while the interaction and communication of information is developing from point-to-point to information chain and even information network, so it is necessary to bring data to cross-link in each dimension, and data explosion is unavoidable. Therefore, streaming and NoSQL products came into being, which solved the problem of real-time framework and data storage computation.

2011年twitter对Storm开源。以前互联网的开发人员在做一个实时应用的时候,除了要关注应用逻辑计算处理本身,还要为了数据的实时流转、交互、分布大伤脑筋。现在开发人员可以快速的搭建一套健壮、易用的实时流处理框架,配合SQL产品或者NoSQL产品或者MapReduce计算平台,就可以低成本的做出很多以前很难想象的实时产品:比如一淘数据部的量子恒道品牌旗下的多个产品就是构建在实时流处理平台上的。

Strom language
Storm's main development language is Clojure, complete core functional logic, auxiliary development language and Python and Java

Features of Strom

  1. Simple programming model

    Storm, like Hadoop, provides simple, elegant primitives for real-time computing of big data, which greatly reduces the complexity of the task of developing parallel real-time processing, helping you develop applications quickly and efficiently.

  2. Can be extended

    There are three main entities that really run topology in a storm cluster: worker processes, threads, and tasks. Multiple worker processes can be run on each machine in a storm cluster, each worker process can create multiple threads, each thread can perform multiple tasks, and the task is the actual data processing entity, and the spout and bolts we develop are executed as one or more tasks. As a result, compute tasks are performed in parallel across multiple threads, processes, and servers, supporting flexible horizontal scaling.

  3. High reliability

    Storm can guarantee that every message sent by spout can be "fully processed". Spout messages may be triggered to generate thousands of messages, can be visualized as a message tree, where spout sends the message is the root, storm will track the processing of this message tree, only if all the messages in this message tree are processed, Storm would have thought that the message sent by spout had been "fully processed". If any message processing in this message tree fails, or if the entire message tree does not have "full processing" within the time limit, the message sent by spout will be re-sent. Taking into account the minimization of memory consumption, storm does not track every message in the message tree, but instead employs a special strategy that tracks the message tree as a whole, zero the unique ID of all messages in the message tree, and determines whether the message sent by spout is " Full processing ", this greatly saves memory and simplifies the decision logic, which is described later in detail in this mechanism.

    This mode, each send a message, will be sent synchronously a ack/fail, for the bandwidth of the network will have a certain consumption, if the reliability requirements are not high, you can use a different emit interface to turn off the mode.

    As mentioned above, storm guarantees that each message will be processed at least once, but for some computing situations it is strictly required that each message be processed only once, and Storm's 0.7.0 introduces a transactional topology that solves this problem.

  4. High fault tolerance

    If there are some exceptions to the message processing, Storm will reschedule the problematic processing unit. Storm ensures that a processing unit runs forever (unless you explicitly kill the processing unit). Of course, if the middle state is stored in the processing unit, then when the processing unit is restarted by storm, it needs to apply itself to the recovery of the intermediate state.

  5. Fast

    The speed here is mainly referred to as the delay. Storm's network through balls, memory computing, its time lag is necessarily much lower than hadoop through HDFS transmission, when the calculation model is more suitable for streaming, Storm's streaming, eliminating the batch processing time to collect data, because storm is a service-oriented job, but also eliminates the job scheduling delay. So in terms of latency, storm is faster than Hadoop.

    Speaking of a typical scenario, thousands of log producers generate log files that require some ETL operations to be stored in a database.

    Assuming that the use of Hadoop, you need to first into HDFs, per minute to cut a file granularity to calculate (this granularity is already extremely thin, then small, HDFs will be a bunch of small files), Hadoop began to calculate, 1 minutes has passed, and then began to dispatch the task took another minute, Then the job runs, assuming that the machine is particularly numerous, a few bills are finished, and then the database is assumed to take a very small amount of time, so that from the data generation to the end can be used at least two minutes past.

    While streaming computing is the data generated, there is a program to always monitor the production of the log, generating a line through a transmission system to the flow of computing systems, and then streaming computing directly processing, after processing directly into the database, each data from production to write to the database, in sufficient resources can be completed at the millisecond level.

  6. Supports multiple programming languages

    In addition to implementing spout and bolts in Java, you can do this with any programming language you are familiar with, thanks to Storm's so-called multi-lingual protocol. A multi-language protocol is a special protocol within storm that allows spout or bolts to use standard input and standard output for message delivery, with a single line of text or multiple lines of JSON encoding.

  7. Support local mode

    Storm has a "local model" that simulates all the functions of a storm cluster in a process, running topology in local mode is similar to running topology on a cluster, which is useful for our development and testing.

The composition of storm
There are two types of nodes in the Storm cluster: The control node (master nodes) and the worker node. The control node runs a daemon called Nimbus, which acts like a jobtracker inside Hadoop. Nimbus is responsible for distributing the code within the cluster, assigning compute tasks to the machine, and monitoring the status.

每一个工作节点上面运行一个叫做Supervisor的节点。Supervisor会监听分配给它那台机器的工作,根据需要启动/关闭工作进程。每一个工作进程执行一个topology的一个子集;一个运行的topology由运行在很多机器上的很多工作进程组成。Nimbus和Supervisor之间的所有协调工作都是通过Zookeeper集群完成。另外,Nimbus进程和Supervisor进程都是快速失败(fail-fast)和无状态的。所有的状态要么在zookeeper里面, 要么在本地磁盘上。这也就意味着你可以用kill -9来杀死Nimbus和Supervisor进程, 然后再重启它们,就好像什么都没有发生过。这个设计使得Storm异常的稳定。接下来我们再来具体看一下这些概念。Nimbus:负责资源分配和任务调度。Supervisor:负责接受nimbus分配的任务,启动和停止属于自己管理的worker进程。Worker:运行具体处理组件逻辑的进程。Task:worker中每一个spout/bolt的线程称为一个task. 在storm0.8之后,task不再与物理线程对应,同一个spout/bolt的task可能会共享一个物理线程,该线程称为executor。下面这个图描述了以上几个角色之间的关系。

Picture description (max. 50 words)

Topology Fundamentals
The storm cluster and the Hadoop cluster surface look very similar. But it's the mapreduce jobs that runs on Hadoop, and the Topology (topology) that runs on Storm is very different. One key difference is that a mapreduce job will eventually end, and a topology will always run (unless you kill manually).

1 拓扑(Topologies)一个topology是spouts和bolts组成的图, 通过stream groupings将图中的spouts和bolts连接起来,如:

Picture description (max. 50 words)

A topology will run until you kill manually, Storm automatically reassign failed tasks, and storm can guarantee that you won't have data loss (if high reliability is turned on). If some machines stop unexpectedly, all the tasks above it will be transferred to other machines. The 2-stream (Streams) data stream (Streams) is the core abstraction in Storm. A data flow refers to a set of tuple (tuple) XXX sequences that are created and processed in parallel in a distributed environment. The data flow can be defined by a pattern that can represent the domain (fields) of tuples in the data flow. By default, a tuple (tuple) contains integer (integer) numbers, long digits, short integer numbers, byte (byte), double-precision floating-point numbers (double), single-precision floating-point numbers (float), A primitive type object such as a Boolean value and a byte array. Of course, you can also implement a custom tuple type by defining a serializable object. The 3 data source (spouts) data source (Spout) is the source of the data flow in the topology. Typically Spout reads tuples from an external data source and sends them to the topology. Depending on the requirements, Spout can be defined either as a reliable data source or as an unreliable data source. A reliable Spout is able to resend the tuple when it sends a tuple processing failure, to ensure that all tuples are handled correctly, and that the unreliable Spout does not do any other processing of tuples after the tuple has been sent. A Spout can send multiple streams of data. To achieve this, you can declare a different data stream by declaring it with the Declarestream method of Outputfieldsdeclarer, and then in the emit method of Spoutoutputcollector when the data is sent, the data flow ID The ability to send data as a parameter. The key method in Spout is nexttuple. As the name implies, Nexttuple either sends a new tuple to the topology or returns directly when there are no tuples to send. It is important to note that because Storm is calling all the Spout methods in the same thread, Nexttuple cannot be blocked by any other functional method of Spout, or it will directly result in the interruption of the data flow. The other two key methods in Spout are ACK and fail, which are used for further processing after Storm detects that a sent tuple has been successfully processed or failed to process. Note that the ACK and fail methods are only valid for the "reliable" Spout described above. 4 Data stream processing component (BOLTS) All data processing in the topology is done by bolts. With functions such as data filtering (filtering), function processing (functions), aggregation (aggregations), Junction (joins), and database interaction, bolt can accomplish almost any kind of data processing requirement. A bolt enables simple data flow conversions, More complex data flow transformations typically require multiple bolts and are completed in multiple steps. For example, a data stream that converts a microblogging data flow to a trend image has at least two steps: One bolt is used to scroll through the tweets of each image, and the other or more bolts output the data stream as the "most forwarded picture" result (with 2 bolts, if 3 Bolt you can make this transition more scalable). As with Spout, bolts can also output multiple data streams. To achieve this, you can declare a different data stream by declaring it with the Declarestream method of Outputfieldsdeclarer, and then in the emit method of Outputcollector when the data is sent, the data flow ID The ability to send data as a parameter. When defining the input data flow for a Bolt, you need to subscribe to the specified data stream from other Storm components. If you need to subscribe to data streams from all other components, you must register each component separately when you define the Bolt. For the data flow declared as the default ID (the "default"-translator note mentioned above), Inputdeclarer supports the syntactic sugars that subscribe to such data streams. That is, if you need to subscribe to a data stream from component "1," declarer.shufflegrouping ("1") is equivalent to declarer.shufflegrouping ("1", default_stream_id) declaration. The key method of Bolt is the Execute method. The Execute method is responsible for receiving a tuple as input and sending a new tuple using the Outputcollector object. If there is a need for message reliability assurance, Bolt must call Outputcollector's Ack method for each tuple it handles so that Storm can understand whether the tuple is processing complete (and ultimately decide whether it can respond to the original Spout output tuple tree). In general, for each input tuple, you can choose not to send or send multiple new tuples after processing, and then respond (ACK) to the input tuples. The Ibasicbolt interface enables automatic response of tuples. 5 numberData stream grouping (stream groupings) is an important part of defining a topology for each Bolt in the topology. The data flow grouping defines how the data flow is partitioned in different tasks of the Bolt. There are eight built-in data flow groupings in Storm, and you can implement a custom data flow grouping model through the Customstreamgrouping interface. These eight groups of ticks are: 1 respectively. Random grouping (Shuffle grouping): In this way, the tuples are randomly assigned to different tasks of the Bolt (tasks), so that the number of tuples processed by each task can remain basically consistent to ensure load balancing of the cluster. 2. Domain grouping (fields grouping): This way the data flow is grouped according to the defined "domain". For example, if a data flow is grouped based on a field named "User-id", then all tuples that contain the same "User-id" are assigned to the same task, ensuring consistency of message processing. 3. Partial keyword grouping (partial key grouping): This approach is similar to domain groupings, grouping data streams according to defined domains, but this approach takes into account the problem of the equalization of downstream Bolt data processing, with better performance 1 when the input data source keyword is unbalanced. Interested readers can refer to this paper, which explains in detail how this grouping works and its merits. 4. Complete grouping (all grouping): In this way the data stream is sent to all tasks of the Bolt simultaneously (that is, the same tuple is copied multiple copies and then processed by all tasks), so use this grouping with special care. 5. Global grouping: All traffic in this way will be sent to the same task of the Bolt, which is the task with the smallest ID. 6. Non-grouping (None grouping): Use this way to show that you don't care how data flows are grouped. The results of this approach are now completely equivalent to random groupings, but the future Storm community might consider using a non-grouping approach to make bolts and the Spout or bolts it subscribes to execute in the same thread. 7. Direct grouping: This is a special way of grouping. Using this approach means that the sender of a tuple can specify which task downstream can receive the tuple. You can use direct grouping only if the data flow is declared as a direct data stream. Sending tuples using direct data streams requires using one of the emitdirect methods of Outputcollector. Bolt canThe Topologycontext is used to obtain the task ID of its downstream consumer, or it can obtain the task ID by tracking data for the Outputcollector emit method that returns the ID of the target task for the tuple it is sending. 8. Local or Shuffle grouping: If the target Bolt has one or more task threads in the worker process of the source component, the tuples are randomly assigned to those tasks that are in the same process. In other words, this has a similar effect to the way random groups are grouped. 6 tasks in a Storm cluster, each Spout and Bolt is executed by several tasks. Each task corresponds to an execution line threads. Data flow grouping determines how tuples are sent to another set of tasks by a set of tasks. You can set the degree of parallelism of Spout/bolt in the Topologybuilder Setspout method and the Setbolt method. 7 The worker process (Workers) topology is run in one or more worker processes (worker processes). Each worker process is an actual JVM process and performs a subset of the topology. For example, if the degree of parallelism for a topology is defined as 300 and the number of worker processes is defined as 50, then each worker process will perform 6 tasks (threads within the process). Storm will distribute the task among all the workers in order to achieve load balancing of the cluster.

Flow-processing framework storm analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.