About Apache Flink
Apache Flink is a scalable, open source batch processing and streaming platform. Its core module is a data flow engine that provides data distribution, communication, and fault tolerance on the basis of distributed stream data processing, with the following architectural diagram:
The engine contains the following APIs:
1. DataSet API for static data embedded in Java, Scala, and Python
2. DataStream API for unbounded streams embedded in Java and Scala, and
3. Table API with a sql-like expression language embedded in Java and Scala.
Flink also contains a number of other areas of the component:
1.Machine Learning Library
2.Gelly, a graph processing API and library
Flink System Overview
Flink supports the Java and Scala language data processing API and has an optimized distributed run custom memory management.
Flink Features
1. Fast,flink uses in-memory data flow and integrated iterative processing at run time, which can become very fast for data-intensive computation and iterative computation.
2, high reliability and high flexibility. The flink contains its own memory management components, serialization components, and type inference components.
3. Elegant and beautiful API design
Workcount Scala Sample
case class Word (word: String, frequency: Int)val counts = text.flatMap {line => line.split(" ").map(word => Word(word,1))}.groupBy("word").sum("frequency"
Closure code example
case class Path (from: Long, to: Long)val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges).where("to").equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths).distinct() next}
4. Compatible with Hadoop, can be run on yarn
Reference
Apache Flink
About Apache Flink