Distributed stream processing, similar to a general-purpose computational model such as MapReduce, but requires it to be able to respond at the millisecond or second level. These systems can use Dags to represent the topology of the stream processing.
Points of Interest
In comparison with different systems, you can refer to the following points
- Runtime and programming model (running and programming models)
A platform-provided programming model often determines many of its features, and the programming model should be sufficient to handle all possible user stories.
- Functional Primitives (functional unit)
A qualified processing platform should be able to provide a wealth of functions that can be processed at an independent level of information, such as maps and filter, which are easy to implement and extend. It is also necessary to provide cross-information functions such as aggregation and functions that operate across streams such as joins, although such operations can be difficult to scale.
- State Management (Status management)
- Message Delivery Guarantees (the accessibility guarantee for delivery of messages)
- At the most once
- At least once
- Exactly once
In general, for message delivery, we have up to three scenarios (at most once), at least once (at least once), and exactly once (exactly once).
In a stream processing system, errors can often occur at different levels, such as network segmentation, disk errors, or a node that is somehow hung up. The platform should be able to recover smoothly from these failures and be able to continue processing from the last normal state without compromising the results.
In addition, we should also consider the ecosystem of the platform, the completeness of the community, and whether it is easy to develop or whether it is easy to Operation and so on.
Runtime and Programming Model
The operating environment and programming model determine the capabilities of the system and the user stories used. Because it defines the rendering characteristics of the entire system, the operations that may be supported, and some future limitations, and so on.
Currently, there are two main ways to build a flow-processing system
1) One of them is called native streaming, which means that all input records or events are processed according to the order in which they enter.
Pros: Short Response delay
Cons: low throughput; the cost of fault tolerance is higher because of the need to persist, and load balancing is a problem that cannot be ignored.
2) Another method is called micro-batching. A large number of short batches are created from the input records and then processed by the entire system, and these batches are created according to the preset time constants, usually created in batches of every few seconds.
Benefits: Fault tolerance and load balancing are easier to implement
Cons: Some things like state management or joins, splits are more difficult to implement because the system has to process the entire batch
As far as the programming model is concerned, it can be divided into compositional (combination) and declarative (declarative).
1) The combination provides a series of basic artifacts, similar to source reads and operators, that developers need to group together and then form an expected topology. New artifacts can often be created by inheriting and implementing an interface.
2) on the other hand, operators in declarative APIs tend to be defined as higher-order functions. The declarative programming model allows us to use abstract types and all other selected materials to write functional code and to optimize the entire topology diagram. At the same time, the declarative API provides some high-level, out-of-the-box operators like window management, state management, and so on.
Apache Streaming Landscape
There are now a variety of flow processing frameworks, which naturally cannot be resist in all of this article. So I have to limit the discussion to some extent, and this article is a discussion of the framework of all Apache-owned streaming, and these frameworks have already provided Scala's syntax interfaces. The main words are storm and one of its improvements, Trident storm, and Spark, which is now fire. Finally, we will discuss the Samza from LinkedIn and the more promising Apache Flink. I personally think this is a very good choice, because although these frameworks are based on the scope of flow processing, but their means of implementation vary widely.
Apache Storm was originally created by Nathan Marz and his Backtype team in 2010. It was later acquired by Twitter and open source, and became the top-level project of Apache in 2014. There is no doubt that Storm is a pioneer in large-scale streaming and is becoming an industry standard. Storm is a typical native streaming system and provides a large number of low-level operating interfaces. In addition, Storm uses thrift to define the topology and provides interfaces to a number of other languages.
Trident is a storm-built, upper-level micro-batching system that simplifies the topology building of storm and provides features such as Windows, aggregations, and state management that are not supported by storm primitives. In addition, storm is the delivery principle that achieves at most one time, and Trident realizes the delivery principle that happens once. Trident provides Java, Clojure, and Scala interfaces.
As we all know, Spark is a very popular library that provides a built-in batch framework similar to Sparksql, Mlib, and it also provides an excellent streaming framework such as spark streaming. Spark's running environment provides batch processing capabilities, so spark streaming undoubtedly implements the micro-batching mechanism. The input stream is created as a micro-batches by the receiver segment and then processed like any other spark task. Spark provides Java, Python, and Scala interfaces.
Samza was pioneered by LinkedIn as an excellent streaming solution to work with Kafka, and Samza is already one of the key infrastructure within LinkedIn. The Samza burden relies on the Kafaka log-based mechanism, both of which are very well combined. SAMZA provides a compositional interface and also supports Scala.
Finally talk about Flink. Flink, a very old project, was first launched in 2008, but is attracting more and more attention. Flink is also a native streaming system and offers a large number of high-level APIs. Flink also provides batch processing like spark, which can be viewed as a special case of stream processing. Flink stresses that all things are flowing, which is an absolute better abstraction, after all it is true.
The following table lists the features of the above frameworks in a nutshell:
Reference
r1190000004593949?_ea=665564
Apache Stream Processing Framework Comparison