Three kinds of frameworks for streaming big data processing: Storm,spark and Samza

Source: Internet
Author: User

Three kinds of frameworks for streaming big data processing: Storm,spark and Samza

Many distributed computing systems can handle big data streams in real-time or near real-time. This article provides a brief introduction to the three Apache frameworks, such as Storm, Spark, and Samza, and then tries to quickly and highly outline their similarities and differences.

Many distributed computing systems can handle big data streams in real-time or near real-time. This article will briefly introduce the three Apache frameworks, and then try to quickly and highly outline their similarities and differences.

Apache Storm

In storm, we first design a graph structure for real-time computing, which we call topology (topology). This topology will be presented to the cluster, which distributes the code by the master node in the cluster and assigns the task to the worker node. A topology includes spout and bolt two roles, where Spout sends a message that is responsible for sending the data stream as a tuple tuple, while Bolt is responsible for converting the data streams, which can be computed, filtered, and so on in the bolt. The bolt itself can also randomly send data to other bolts. The tuple emitted by spout is an immutable group, which corresponds to a fixed key-value pair.

Apache Spark

Spark Streaming is an extension of the core Spark API that does not process data streams one at a time, like storm, but rather splits them into batches of batch jobs at intervals before processing. The abstraction of spark for persistent traffic is called Dstream (Discretizedstream), a dstream is a micro-batch (micro-batching) Rdd (Elastic distributed DataSet), and the RDD is a distributed data set, Can operate in parallel in two ways, namely the conversion of arbitrary function and sliding window data.

Apache Samza

When the SAMZA processes the data stream, each received message is processed individually. Samza flow units are neither tuples nor dstream, but a message. In Samza, the data stream is cut apart, each part consists of an ordered series of read-only messages, each with a specific ID (offset). The system also supports batching, that is, successive processing of multiple messages for the same data stream partition. Samza's execution and data flow modules are pluggable, although SAMZA is characterized by yarn that relies on Hadoop (another resource scheduler) and Apache Kafka.

The common point

All three of these real-time computing systems are open-source distributed, with low latency, scalability, and fault tolerance, all of which feature: allowing you to run parallel on a series of fault-tolerant computers while running your data flow code. In addition, they all provide a simple API to simplify the complexity of the underlying implementation.

The terms of the three frameworks are different, but the concept of their representation is very similar:

Comparison chart

The following table summarizes some of the differences:

Data transfer forms fall into three main categories:

    1. At most one time (at-most-once): Messages may be lost, which is usually the least desirable result.
    2. At least once (at-least-once): Messages may be sent again (without loss, but with redundancy). is sufficient in many use cases.
    3. Exactly once (exactly-once): Each message is sent once and only once (no loss, no redundancy). This is the best case, although it is difficult to ensure that it is implemented in all use cases.

Another aspect is state management: there are different policies for state storage, and Spark streaming writes data to the Distributed file system (for example, HDFs), Samza uses embedded key-value storage, and in storm, or rolls state management to the application level, Or use a higher-level abstraction Trident.


These three frameworks perform well and efficiently when dealing with a large amount of real-time data in a continuum, so which one to use? There are no hard rules to choose from, and at best there are several guidelines.

If you want a high-speed event-processing system that allows incremental computing, storm will be the best choice. It can handle the need for further distributed computing while the client waits for results, using out-of-the-box distributed RPC (DRPC). Last but not least: Storm uses Apache Thrift, and you can write topologies in any programming language. If you need a state that lasts, and/or achieves exactly one pass, you should look at the higher-level trdent API, which also provides a micro-batch approach.

Use Storm The company has: Twitter , Yahoo, Spotify also have The Weather Channel and so on.

When it comes to micro-batching, if you must have a status calculation, exactly one delivery, and do not mind high latency, then spark streaming can be considered, especially if you also plan graphics operations, machine learning, or access to SQL, Apache The spark stack allows you to combine some libraries with the data stream (spark SQL,MLLIB,GRAPHX), which provides a convenient, integrated programming model. In particular, data stream algorithms (e.g., K-mean streaming) allow spark real-time decision-making to be facilitated.

Use Spark The companies are: Amazon, Yahoo, NASA JPL , EBay There are Baidu and so on.

If you have a large number of States to work with, such as having many 1 billion-bit tuples per partition, you can choose Samza. Because Samza places storage and processing on the same machine, it does not load additional memory while maintaining efficient processing. This framework provides a flexible pluggable API: its default execution, message delivery, and storage engine operations can be replaced at any time depending on your choice. In addition, if you have a large number of data flow processing stages and separate teams from different code libraries, Samza's fine-grained work features are particularly useful because they can be added or removed with minimal impact.

Use Samza The company has: LinkedIn , Intuit , metamarkets , quantiply , Fortscale and so on.


In this article, we only have a simple understanding of these three Apache frameworks, and do not cover a large number of features and more subtle differences in these frameworks. At the same time, the comparison of the three frameworks is also constrained, as these frameworks are constantly evolving, and this is something we should keep in mind.

Original link: Streaming Big data:storm, Spark and Samza (Compile/Wei Sun Zebian/Zhou Jianding)

Three kinds of frameworks for streaming big data processing: Storm,spark and Samza

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.