Streaming Big Data:storm, Spark and samza--reprint

Source: Internet
Author: User
Tags value store

Original address: http://www.javacodegeeks.com/2015/02/streaming-big-data-storm-spark-samza.html

There is a number of distributed computation systems that can process the Big Data in real time or near-real time. This article'll start with a short description of three Apache frameworks, and attempt to provide a quick, high-level ov Erview of some of their similarities and differences.

Apache Storm

In Storm, you design a graph of real-time computation called a topology, and feeds it to the CLU Ster where the master node would distribute the code among worker nodes to execute it. In a topology, data was passed around between spouts that emit data streams as immutable sets of key-value pairs C alled tuples, and bolts that transform those streams (count, filter etc). Bolts themselves can optionally emit data to other bolts down the processing pipeline.

Apache Spark

Spark Streaming (an extension of the core Spark API) doesn ' t process streams one in a time like Storm. Instead, it slices them in small batches of time intervals before processing them. The Spark abstraction for a continuous stream of data was called a DStream (for discretized stream). A DStream is a micro-batch of RDDs (resilient distributed Datasets). RDDs is distributed collections that can is operated in parallel by arbitrary functions and by transformations over a SLI Ding window of data (windowed computations).

Apache Samza

Samza ' s approach to streaming was to process messages as they was received, one at a time. Samza ' s stream primitive is not a tuple or a Dstream, but a message. Streams is divided to partitions and each partition is a ordered sequence of read-only messages with each MES Sage has a unique ID (offset). The system also supports batching, i.e. consuming several messages from the same stream partition in sequence. Samza ' s execution & streaming modules is both pluggable, although Samza typically relies on Hadoop's YARN (yet an Other Resource negotiator) and Apache Kafka.

Common Ground

All three real-time computation systems is Open-source, low-latency, distributed, scalable and FAU Lt-tolerant. They all allow you to run your stream processing code through parallel tasks distributed across a cluster of comp Uting machines with fail-over capabilities. They also provide simple APIs to abstract the complexity of the underlying implementations.

The three frameworks use different vocabularies for similar concepts:

Comparison Matrix

A Few of the differences is summarized in the table below:

There is three general categories of delivery patterns:

    1. at-most-once: Messages May lost. This is usually the least desirable outcome.
    2. at-least-once: Messages May is redelivered (no loss, but duplicates). This is good the enough for many use cases.
    3. exactly-once: Each message was delivered once and only once (no loss, no duplicates). This was a desirable feature although difficult to guarantee in all cases.

Another aspect is the state management. There is different strategies to store state. Spark streaming writes data into the Distributed file system (e.g. HDFS). Samza uses an embedded Key-value store. With Storm, you'll have the either roll your own state management at your application layer, or use a higher-level abstract Ion called Trident.

Use Cases

All three frameworks is particularly well-suited to efficiently process continuous, massive amounts of real-time data. So which one-to-use? There is no hard rules, at most a few general guidelines.

If you want a high-speed event processing system This allows for incremental computations, Storm would be fine fo R that. If you further need to run distributed computations on demand and while the client is waiting synchronously for the results, You'll have distributed RPC (DRPC) Out-of-the-box. least, because Storm uses Apache Thrift, you can write topologies in any programming language. If you need state persistence and/or exactly-once delivery though, your should look at the higher-level Trident API, which also offers micro-batching.

A few companies using Storm: Twitter, Yahoo!, Spotify, the Weather Channel...

Speaking of micro-batching, if you must has stateful computations, exactly-once delivery and don ' t mind a higher latency, Could consider Spark streaming...specially If you also plan to graph operations, machine learning or SQL ACCE Ss. The Apache Spark stack lets you combine several libraries with streaming (Spark SQL, MLlib, GraphX) and provides a conveni ENT unifying programming model. In particular, streaming algorithms (e.g. streaming K-means) allow Spark to facilitate decisions in real-time.

A few companies using Spark: Amazon, Yahoo!, NASA JPL, EBay Inc, Baidu ...

If you had a large amount of state to work with (e.g. many gigabytes per partition), Samza co-locates Storage an D processing on the same machines, allowing to work efficiently with state that won ' t fit in memory. The framework also offers flexibility with its pluggable api:its default execution, messaging and storage engine s can each is replaced with your choice of alternatives. Moreover, if you had a number of data processing stages from different teams with different codebases, Samza ' s Fine-grai Ned jobs would be particularly well-suited, since they can is added/removed with minimal ripple effects.

A few companies using Samza: LinkedIn, Intuit, Metamarkets, quantiply, Fortscale ...

Conclusion

We only scratched the surface of the three Apaches. We didn ' t cover a number of other features and more subtle differences between these frameworks. Also, it's important to keep on mind the limits of the above comparisons, as these systems is constantly evolving.

Streaming Big Data:storm, Spark and samza--reprint

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.