From http://www.dataguru.cn/article-9532-1.html
The demand for distributed streaming is increasing, including payment transactions, social networks, the Internet of Things (IoT), system monitoring, and more. There are several applicable frameworks for convection processing in the industry, so let's compare the similarities and differences of each stream processing framework.
Distributed stream processing is the continuous processing, aggregation and analysis of the borderless data sets. It is a common calculation like mapreduce, but we expect it to be delayed in milliseconds or seconds. This type of system generally uses a directed acyclic graph (DAG).
A dag is a graphical representation of a task chain, which we use to describe the topology of a stream processing job. As shown below, data flows from sources through the processing task chain to sinks. A single computer can run a DAG, but this article focuses on running dags on multiple machines.
Focus when choosing a different flow processing system, there are a few things to note:
The runtime and programming model: The programming model provided by the platform framework determines many features, and the programming model is sufficient to handle a variety of application scenarios. This is a fairly important point and the follow-up will continue.
Functional primitive: The streaming platform should be able to provide rich function functions, such as map or filter such as easy to expand, processing a single piece of information functions, processing multiple pieces of information function aggregation, cross-flow, difficult to expand operation join.
State management: Most applications need to maintain state-handling logic. The stream processing platform should provide storage, access, and update status information.
Message Transmission assurance: There are generally three message transmission guarantees: At the most once,at least once and exactly once. The message transmission mechanism at the most once is that each message is transmitted 0 or one time, that is, the message may be lost; A t least once means that each message is tried multiple times, at least once, that the message transmission may be repeated but not lost; exactly The message transmission mechanism of once is that each message has only one time, that is, the message transmission is neither lost nor duplicated.
Fault tolerance: failures in the stream processing framework can occur at various levels, such as network parts, disk crashes, or node outages. The stream processing framework should have the ability to recover from all such failures and re-consume from the previous successful state (no dirty data).
Performance: Delay Time (Latency), throughput (throughput), and Extensibility (Scalability) are the most important metrics in a stream processing application. Platform maturity and acceptance: a mature streaming framework can provide potential support, available libraries, and even development FAQ help. Choosing the right platform can be a great help in this area.
The runtime and programming model runtimes and programming models are the most important traits of a system because they define the way they are expressed, the possible actions, and future limitations. Therefore, the runtime and programming model determine the system's capabilities and the applicable scenarios.
There are two completely different ways to implement a streaming system: one is called native stream processing, which means that all incoming records are processed one after the other as soon as they arrive.
The second is called micro-batch processing. The input data is divided into short bulk data at a predefined time interval (typically a few seconds), flowing through the stream processing system.
The
Two methods have their innate advantages and disadvantages. The
begins with native stream processing, and the advantage of native stream processing is its expression. Once data arrives for immediate processing, these systems are much more delayed than other micro batches. In addition to latency, the state operation of the native stream processing is easy to implement, followed by a detailed explanation.
a generic native stream processing system can cost a lot to achieve low latency and fault tolerance because it needs to consider each record. Load balancing for native stream processing is also a problem. For example, the data we are working on is partitioned by key, and if one key of the partition is resource intensive, the partition can easily become a bottleneck for the job. The
next looks at the micro-batch processing. The decomposition of flow calculation into a series of short batch processing operations, but also inevitably weaken the system's expressive force. Implementation of operations such as state management or join can become difficult because the micro-batch system must operate the entire batch of data. Also, batch interval connects two things that are not easy to connect: the underlying properties and the business logic.
Conversely, the fault tolerance and load balancing of the micro-batch system is very simple to implement, because the micro-batch system only sends each batch of data to a worker node, and if some data goes wrong, use a different copy. Micro-batch processing system is easy to build on the original stream processing system. The
programming model is generally divided into modular and declarative types. Modular programming provides basic building blocks that must be tightly coupled to create topologies. New components are often done in an interface way. In contrast, declarative API operations are defined higher-order functions. It allows us to write function code with abstract types and methods, and the system creates the topology and optimizes the topology. Declarative APIs often also provide more advanced operations (such as window functions or state management). The sample code will be given shortly after. The
Mainstream stream processing system has a range of implementations of the streaming framework that cannot be enumerated, only the mainstream stream processing solution is selected, and the Scala API is supported. So we'll cover Apache Storm,trident,spark Streaming,samza and Apache Flink in detail. While all of the preceding selections are flow-processing systems, the methods they implement contain a variety of different challenges. This is a temporary non-commercial system, such as Google Millwheel or Amazon Kinesis, and will not involve the rarely used Intel Gearpump or Apache Apex.
Apache storm was first developed by Nathan Marz and his team in 2010 at data analysis company Backtype, and later Backtype was acquired by Twitter, Then Twitter open source Storm and became Apache's top project in 2014. There is no doubt that Storm has become a pioneer in large-scale streaming data processing and is becoming an industry standard. Storm is a native stream processing system that provides low-level APIs. Storm uses thrift to define topology and support multi-lingual protocols so that we can develop in most programming languages, and Scala naturally includes it.
Trident is a higher-level abstraction of storm, and Trident's biggest feature is streaming in batch form. Trident simplifies the topology build process, adding advanced operations such as window operations, aggregation operations, or state management, which are not supported in storm. Trident provides exactly once transmission mechanism relative to storm's at-most once streaming mechanism. Trident supports Java,clojure and Scala.
Current Spark is a very popular batch framework that includes spark sql,mllib and spark streaming. The runtime of Spark is based on batching, so the subsequent spark streaming also relies on batching for micro-batching. The receiver divides the input data stream into short batches and processes micro batches in a similar way to spark jobs. Spark Streaming provides a high-level declarative API (support for Scala,java and Python).
Samza was initially developed as a stream-processing solution for LinkedIn, and has been a key part of the infrastructure by contributing to the community with LinkedIn's Kafka. The construction of SAMZA relies heavily on log-based Kafka, which are tightly coupled. SAMZA provides a modular API and of course supports Scala.
Finally, we introduce Apache Flink. Flink was a fairly early project, beginning in 2008, but only recently. Flink is a native stream processing system that provides an API for high level. Flink also provides APIs to batch-process like spark, but the basis for processing them is completely different. Flink treats batch processing as a special case in stream processing. In Flink, all data is considered a stream and is a good abstraction because it is closer to the real world.
After a quick introduction to the stream processing system, let's use the table below to better illustrate the difference between them:
Word Count wordcount is learning from the flow-processing framework, just like Hello World for programming language learning. It's a good way to show the difference between the flow-processing frameworks, and let's start with Storm to see how to implement WordCount:
Topologybuilder builder = new Topologybuilder (); Builder.setspout ("spout", New Randomsentencespout (), 5); Builder.setbolt ("Split", New Split (), 8). shufflegrouping ("spout"); Builder.setbolt ("Count", New WordCount (), N). Fieldsgrouping ("Split", New Fields ("word")); ... Map counts = new HashMap (); public void execute (tuple tuple, basicoutputcollector collector) {String word = tuple.getstring (0); Integer count = Counts.containskey (word)? Counts.get (word) + 1:1; Counts.put (Word, count); Collector.emit (New Values (Word, count)); First, define topology. The second line of code defines a spout as the data source. Then there is a processing component bolt, which splits the text into words. Next, define another bolt to calculate the number of words (the fourth line of code). You can also see the magic number 5,8 and 12, which are the degree of parallelism that defines the number of independent threads that each component of the cluster executes. Line eighth through 15 is the actual wordcount bolt implementation. Because Storm does not support built-in state management, all of this defines a local state.
As described earlier, Trident is a higher-level abstraction of storm, and Trident the biggest feature is streaming in batch form. In addition to other advantages, Trident provides state management, which is useful for wordcount implementations.
public static Stormtopology buildtopology (Localdrpc drpc) {fixedbatchspout spout = ... Tridenttopology topology = new Tridenttopology (); Tridentstate wordcounts = Topology.newstream ("spout1", spout). Each (The new fields ("sentence"), new Split (), the new fields (" Word "). GroupBy (New fields (" word ")). Persistentaggregate (New Memorymapstate.factory (), New Count (), New fields (" Count ")); ... } As you can see, the above code uses higher level operations, such as each (line seventh code) and GroupBy (line eighth). and use Trident to manage the state to store the number of words (the Nineth Line of code).
Here's the time to sacrifice Apache Spark, which provides a declarative API. Remember, compared to the previous example, the code is fairly simple and has little redundant code. The following is a simple flow-calculated word count:
Val conf = new sparkconf (). Setappname ("WordCount") val ssc = new StreamingContext (conf, Seconds (1)) val Text = ... val cou NTS = Text.flatmap (line = Line.split ("")). Map (Word = = (Word, 1)). Reducebykey (_ + _) Counts.print () Ssc.start () Ssc.awaittermination () Each spark streaming job has StreamingContext, which is the entry for the streaming function. StreamingContext loads the configuration conf for the first line of code definition, but more importantly, the second line of code defines the batch interval (set here to 1 seconds). Line six to eight lines of code is the entire number of words calculated. These are standard functional codes, and spark defines topology and distributed execution. The 12th line of code is the last part of each spark streaming job: Start the calculation. Remember, the Spark streaming job is not modifiable once it is started. Next look at Apache Samza, another example of a modular API:
Class Wordcounttask extends Streamtask {override def process (Envelope:incomingmessageenvelope, Collector:messagecolle ctor, Coordinator:taskcoordinator) {val text = envelope.getmessage.asinstanceof[string] val counts = Text.s Plit (""). Foldleft (Map.empty[string, Int]) {(count, Word) = + count + (count.getorelse (Word, 0) + 1)) } collector.send (New Outgoingmessageenvelope (New Systemstream ("Kafka", "WordCount"), Counts)} SAMZA attribute profile definition to Pology, in order to be concise here does not put the configuration file. Define the input and output of the task and communicate through Kafka topic. In terms of the number of words the entire topology is wordcounttask. In Samza, implement the Special interface definition component Streamtask, in the third line of code rewrite method process. Its parameter list contains all the requirements for connecting to other systems. Line eighth to line 10 simple Scala code is the calculation itself.
The Flink API is strikingly similar to spark streaming, but notice that the batch interval is not set in the code.
Val env = executionenvironment.getexecutionenvironment val text = env.fromelements (...) Val counts = Text.flatmap (_.split ("")). Map ((_, 1)). GroupBy (0). SUM (1) counts.print () Env.execute ("Word Count ") above the code is quite straightforward, just a few function calls, Flink support distributed computing.
Fault-tolerant stream processing systems are inherently more difficult to implement than batch processing systems. When an error occurs in a batch system, we simply restart the failed part, but for the streaming system, it is difficult to recover the error. Because many of the online jobs are 7 x 24 hours running, there is constant data input. Another challenge for stream processing systems is state consistency, because duplicate data occurs after a reboot, and not all state operations are idempotent. Fault tolerance is so hard to achieve, let's look at how the main stream processing framework handles this problem.
Apache Storm:storm uses the mechanism of upstream data backup and message acknowledgement to ensure that messages are re-processed after they fail. Message Acknowledgement principle: Each operation will return the acknowledgement of the previous action processing message. The topology data source backs up all the data records it generates. When all data records are processed for confirmation, the backup is safely removed. After a failure, if not all message processing acknowledgement messages are received, the data record is replaced by the data source data. This guarantees no data loss, but the data results are duplicated, which is the at-least once transport mechanism.
Storm uses trickery to complete fault tolerance, requiring only a few bytes of storage for each source data record to track acknowledgement messages. The pure data logging message confirms the schema, although it performs well, but does not guarantee exactly once message transmission mechanism, all application developers need to process duplicate data. Storm has low throughput and flow control problems because the message acknowledgement mechanism is often mistaken for failure under backpressure.
Spark Streaming:spark Streaming implementation of micro-batch processing, the implementation of fault-tolerant mechanism is not the same as Storm method. The idea of micro batch processing is quite simple. Spark processes micro-batches on each worker node in the cluster. Once each micro-batches fails, the recalculation is done. Because the micro-batches itself is immutable, and each micro-batches is also persistent, the exactly once transport mechanism is easy to implement.
The implementation of SAMZA:SAMZA is completely different from the previous two stream processing frameworks. Samza uses the persistence and offset of the message system Kafka. The Samza monitors the offset of the task, and when the task finishes processing the message, the corresponding offset is removed. The offset of the message is checkpoint into the persisted store and resumed on failure. However, the problem is that the upstream message was not known to have been processed since the offset was fixed in the last checkpoint, which would cause duplication. This is the at least once transmission mechanism.
The fault tolerance mechanism for Apache Flink:flink is based on distributed snapshots, which hold the state of the Stream processing job (this article does not differentiate between Flink checkpoints and snapshots, because they are actually two different names for the same thing.) Flink the mechanism for constructing these snapshots can be described as lightweight asynchronous snapshots of distributed data streams, which are implemented using the Chandy-lamport algorithm. )。
In the event of a failure, the system can recover from these checkpoints. Flink sends the checkpoint fence (barrier) into the data stream (the fence is a core element of the Flink distributed snapshot mechanism), and when the checkpoint fence reaches one of the operator, The operator will pick up all the corresponding fences in the input stream (for example, checkpoint n for all input streams in the fence N to N-1, which is only part of the entire input stream).
Therefore, the fault tolerance mechanism relative to storm,flink is more efficient because the flink operation is for small batches of data and not for each data record. But don't let yourself get confused, Flink is still the native stream processing framework, and it's conceptually completely different from spark streaming. Flink also provides a exactly once message transfer mechanism.
State management most large stream processing applications involve state. With respect to stateless operations (which have only one input data, processing and output), stateful applications have an input data and a status message, then process, then output and modify state information.
Therefore, we have to manage the status information and persist it. We expect the state to be repaired once it fails for some reason. There may be minor problems with state repair, it's not always guaranteed to exactly once, and sometimes it's consumed many times, but that's not what we want.
As far as we know, Storm provides at-least once message transmission protection. So how do we use Trident to do exactly once semantics. Conceptually simple, you just need to submit each data record, but it's obviously not that efficient. So you would think that a small batch of data records submitted together will be optimized. Trident defines several abstractions to achieve the semantics of exactly once, as shown in the following diagram, which also has some limitations.
Spark Streaming is a micro-batch processing system that considers state information as a micro-batch data stream. When processing each of the micro-batch data, spark loads the current state information, then uses the function operation to obtain the processed micro-batch data results and modify the loaded state information.
SAMZA implementation of state management is handled through Kafka. Samza has a real state operation, so its task holds a status message and pushes the log of state changes to Kafka. If you need a state rebuild, you can easily reconstruct it from the Kafka topic. In order to achieve faster state management, Samza also supports the placement of state information into local key-value storage, so the status information does not have to be managed in Kafka, as shown in the following figure. Unfortunately, Samza only provides at-least once semantics, and exactly once support is also in the plan.
Flink provides state operations, similar to Samza. Flink provides two types of status: one is user-defined, and the other is the window state. As shown in the figure, the first state is a custom state, and it does not interact with other states. These states can be partitioned or used with embedded key-value storage states [Documents I and II]. Of course Flink provides exactly-once semantics. The following figure shows the three states that Flink has been running for a long time.
The detailed code for the state management word count in the
Word count example is shown in the previous article, which focuses only on the state management Section.
Let's look at Trident first:
public static stormtopology buildtopology (Localdrpc drpc) { fixedbatchspout spout =. .. tridenttopology topology = new Tridenttopology (); tridentstate wordcounts = Topology.newstream ("spout1", spout) .each (New fields (" Sentence "), new Split (), New fields (" word ")) .groupby (New fields (" word ")) . Persistentaggregate (New Memorymapstate.factory (), New Count (), New fields ("Count")); in the nineth line of code, we create a state by calling Persistentaggregate. Where the parameter count stores the number of words, if you want to process the data from the state, you must create a data stream. It can also be seen from the code that it is inconvenient to implement.