Handle the three Apache frameworks common to big data streams: Storm, Spark, and Samza. (mainly about Storm)

Last Update:2017-08-02 Source: Internet

Author: User

Tags zookeeper

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The most common way to deal with real-time big data streams is the distributed computing system, which describes the three main frameworks for processing big data streams in Apache:

Apache Storm

This is a distributed real-time large data processing system. Storm is designed to handle large amounts of data in fault tolerant and horizontally scalable methods. He is a streaming data framework with the highest community rate. Although Storm is stateless, it manages the distributed environment and the chicken herd state through apachezookeeper. It is very simple to use, and it also supports performing various operations on real-time data in parallel. Apache Storm continues to be a leader in real-time data analytics because it's easy to manipulate and set up, and it guarantees that each message will be processed at least once through the topology. Using storm often designs a soil-like structure for real-time computing, called a topology (Topplogy). After the topology is submitted to the cluster, the master node in the cluster distributes the code and assigns the task to the worker node. There are two roles for performing functions in a topology: spout and Bolt, where spout sends a message that is responsible for sending the data stream in the form of a tuple tuple (an immutable group, a fixed key-value pair); Bolt is responsible for converting these data streams, in the bolt can complete the calculation, filtering and other operations, Bolts can also randomly send messages to each other. Here is the storm's cluster design and its internal architecture.

Twitter uses the storm framework to handle streaming big data scenarios:

The input from Twitter analytics comes from the Twitter streaming API. Spout will use the Twitter streaming API to read the user's tweets and output as a tuple stream. A single tuple from spout will have a Twitter user name and a single tweet as a comma-separated value. The tuple's steam is then forwarded to the bolt, and the bolt splits the tweet into a single word, calculates the word count, and saves the information to the configured data source. Now we can easily get results by querying the data source. Apache Storm vs Hadoop

Basically the Hadoop and storm frameworks are used to analyze big data. They complement each other and differ in some ways. Apache Storm performs all operations except persistence, while Hadoop is good in all respects, but lags behind real-time computing. The following table compares the properties of storm and Hadoop.

Storm	Hadoop
Live stream Processing	Batch Processing
No status	Have status
master/Slave architecture and ZooKeeper -based coordination. The primary node is called Nimbus, and the slave node is the supervisor .	A master-slave structure that has/does not have a zookeeper-based coordination. The primary node is the job tracker , from which the node is the task tracker .
The storm stream process can access tens of thousands of messages per second on the cluster.	Hadoop Distributed File System (HDFS) uses the MapReduce framework to process large amounts of data, which can take several minutes or hours.
The storm topology runs until the user shuts down or an unexpected unrecoverable failure.	The MapReduce jobs are executed sequentially and finalized.
Both are distributed and fault-tolerant
If the nimbus/supervisor freezes, restarting makes it continue from where it stopped, so nothing is affected.	If Jobtracker crashes, all running jobs will be lost.

Examples of using Apache storm

Apache Storm is well-known for real-time Big data stream processing. As a result, most companies use storm as an integral part of their systems. Some notable examples are the following-

Twitter -Twitter is using Apache storm as its "publisher Analytics product." The Publisher Analytics product handles every tweets and clicks in the Twitter platform. Apache Storm is deeply integrated with the Twitter infrastructure.

NaviSite -NaviSite is using storm for the event log monitoring/auditing system. Every log generated in the system will pass through storm. Storm will check the message based on the configured regular expression set, and if there is a match, the specific message will be saved to the database.

Wego-WeGo is a travel meta search engine located in Singapore. Travel-related data comes from many sources around the world and varies in time. Storm helps WeGo search real-time data, solve concurrency problems, and find the best match for end users. The advantage of the Apache storm advantage of Storm is that storm is a real-time, continuous distributed computing framework, and once it runs, it will always be in a state of processing or waiting for calculations unless you kill it, and that spark and Hadoop are Strom. But each of these frameworks has its advantages, and each has its own best-case scenario. Storm is the best streaming computing framework, and Storm is written in Java and Clojure, and Storm has the advantage of full-memory computing, so it's positioned as a distributed real-time computing system, according to storm authors. Storm's significance for real-time computing is similar to the meaning of Hadoop for batch processing. Storm's application scenario:
1) Stream Data processing
Storm can be used to handle incoming messages and write the results to a store after processing.
2) distributed RPC. Because storm's processing components are distributed and processing latencies are extremely low, they can be used as a common distributed RPC framework.

Apache Spark

Spark Streaming is an extension of the core Spark API that does not process data streams one at a time, like storm, but rather splits them into batches of batch jobs at intervals before processing. The abstraction of spark for persistent traffic is called Dstream (Discretizedstream), a dstream is a micro-batch (micro-batching) Rdd (Elastic distributed DataSet), and the RDD is a distributed data set, Can operate in parallel in two ways, namely the conversion of arbitrary function and sliding window data. There are two ways that spark submits jobs:

Apache Samza

When the SAMZA processes the data stream, each received message is processed individually. Samza flow units are neither tuples nor dstream, but a message. In Samza, the data stream is cut apart, each part consists of an ordered series of read-only messages, each with a specific ID (offset). The system also supports batching, that is, successive processing of multiple messages for the same data stream partition. Samza's execution and data flow modules are pluggable, although SAMZA is characterized by yarn that relies on Hadoop (another resource scheduler) and Apache Kafka.

Comparison of three types of frames:

What's in common:

All three of these real-time computing systems are open-source distributed, with low latency, scalability, and fault tolerance, all of which feature: allowing you to run parallel on a series of fault-tolerant computers while running your data flow code. In addition, they all provide a simple API to simplify the complexity of the underlying implementation. The difference: From an application perspective, Storm is the best choice if you want a high-speed event-processing system that allows incremental computing. It can handle the need for further distributed computing while the client waits for results, using out-of-the-box distributed RPC (DRPC). Last but not least: Storm uses Apache Thrift, and you can write topologies in any programming language. If you need a state that lasts, and/or achieves exactly one pass, you should look at the higher-level trdent API, which also provides a micro-batch approach.

Handle the three Apache frameworks common to big data streams: Storm, Spark, and Samza. (mainly about Storm)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More