Https://www.iteblog.com/archives/1624.html
Whether we need another new data processing engine. I was very skeptical when I first heard of Flink. In the Big data field, there is no shortage of data processing frameworks, but no framework can fully meet the different processing requirements. Since the advent of Apache Spark, it seems to have become the best framework for solving most of the problems today, so I have a strong skepticism about another framework that solves similar problems.
But because of curiosity, I spent a few weeks trying to understand Flink. Starting with a closer look at some of Flink's examples, it feels like spark very much, and psychology tends to think that Flink is a framework that mimics spark. But with the depth of understanding, these APIs reflect some of the flink of the new ideas, these ideas and spark has a more obvious difference. I was fascinated by these ideas, so I spent more time on it.
Many of the ideas in Flink, such as memory management, have already appeared in spark and have proven that these ideas are very reliable. So, an in-depth look at Flink may help us in the future of distributed data processing.
In a later article, I'll write myself as a spark developer's first impression of Flink. Because I have been working on spark for more than 2 years, but only in flink contact for 2-3 weeks, so there must be some bias, so we also take a skeptical and critical point of view of this article.
Article Listing 1 Apache Flink is what 2 Apache Spark vs Apache Flink 2.1 1, Abstract Abstraction 2.2 2, memory management 2.3 3, language implementation 2.4 4, API 2.5 5, steaming 2.6 6, SQL interface 2.7 7, integration of external data sources 2.8 8, iterative processing 2.9 9, Stream as platform vs Batch as Platform 3 conclusion Apach E Flink is what
Flink is a new big data processing engine with the goal of unifying data processing from different sources. This goal looks like spark and similar. Yes, Flink is also trying to solve the problem that spark is solving. Both systems are trying to build a unified platform that can run batch, streaming, interactive, graph processing, machine learning and other applications. So, Flink and spark are not very different, their main difference is the details of the implementation, and I'll focus on the two from different angles later on. Apache Spark vs Apache Flink 1. Abstract Abstraction
In Spark, we have an rdd for batching, we have dstream for streaming, but the inside is actually an RDD. So all data representations are essentially rdd abstractions. I'll focus on the two from different angles later on. In Flink, there is a dataset for batch processing and we have datastreams for streaming. Looks like spark, and their differences are:
(i) thedataset at runtime is represented as a running plan (runtime plans)
In Spark, the RDD behaves as Java objects at run time. By introducing tungsten, this piece has a slight change. But in Flink it is shown to be logical plan (logical planning), which sounds familiar. Yes, it is similar to the dataframes in Spark. So the class Dataframe API you use in Flink is optimized as the first priority. However, this optimization is not in the spark rdd.
The dataset in Flink, Dataframe in Spark, is optimized before it runs. The Spark 1.6,dataset API has been introduced into spark and may eventually replace the Rdd abstraction.
(ii) datasets and DataStream are independent APIs
In Spark, all the different APIs, such as Dstream,dataframe, are based on the RDD abstraction. But in Flink, datasets and DataStream are two separate abstractions on the same common engine. So you can't combine the two behaviors together, of course, the Flink community is currently working in this direction (https://issues.apache.org/jira/browse/Flink-2320), but it is not easy to assert the final results. 2. Memory Management
Until version 1.5, Spark is experimenting with Java's memory management to do data caching, which can easily lead to oom or GC. So starting from 1.5, spark began to shift to precise control over the use of memory, which is the tungsten project.
Flink, from day one, insisted on trying to control his memory. This is one of the reasons why spark is taking this route. Flink directly operates binary data in addition to the memory it manages. In spark, starting with 1.5, all dataframe operations are directly on tungsten binary data. 3. Language implementation
Spark is implemented in Scala, which provides programming interfaces for Java,python and R. Flink is implemented in Java, and of course provides the Scala API
So from a language point of view, spark needs to be richer. Because I've been moving to Scala for a long time, I'm not quite sure about the Java API implementations for both. 4. API
Both Spark and Flink mimic Scala's collection API. So from the surface it looks like both. The following is the word count, implemented separately with the RDD and dataset APIs
Spark WordCount Object WordCount {def main (args:array[string]) {val env = new |