Apache Flink: Very reliable, one point not badApache Flink's background
We summarize the data set types (types of datasets) that are primarily encountered in the current data processing aspect at a higher level of abstraction, and the processing models (execution models) that are available for processing data, which are often confusing, but are actually different concepts
type of data set
The data set types that are encountered in the current data processing can be divided into two categories, ①unbounded, infinite Datasets, which are embodied as the fast and continuous flow data ②bounded, the limited data set, which is usually immutable, that is, the updated data set will not occur.
Traditional data processing frameworks often abstract real-world figures into limited data sets, or batch data, but real-world data is virtually limitless, and here are some examples of unbounded datasets
Data generated by end-users interacting with mobile apps or Web apps
Real-time measurement data for physical sensor transmission
Real-time data in financial markets
Machine Log
types of data processing models
Mainly divided into two categories, ①streaming, Flow-type, in the data constantly generated at the same time continue to process data ②batch, batch, in a limited time to complete a batch of data processing, after processing the end of the release of computing resources
Although the effects of a non-matching pairing may not be satisfactory, it is true that you can use any data processing model to process any type of dataset, for example, the batch model has long been used to process unbounded datasets, although it is windowing, There are various problems in state management and disordered data processing.
Flink is based on a streaming model, which continuously processes the data that is generated, and the consistency of the data set type and processing model ensures the accuracy and efficiency of the processing.
the flow gene of Apache Flink
Flink is an open-source framework for distributed streaming data processing, ① it ensures that even if the processed data arrives in a disorderly order, or the delay arrives, it can get the correct processing result ②flink is stateful (stateful), and has good fault tolerance (fault-tolerant), As a result, it can be seamlessly restored in the case of an error, and can ensure that the application state of the Excatly-once ③ good performance in large-scale applications, with high throughput and low latency on thousands of nodes
The benefits of maintaining the consistency of the data set type and data processing model are mentioned earlier, and the nature of the Flink mentioned below, including state management, unordered data processing, and flexible windowing, are all designed and optimized for accurate computation on infinite datasets.
excatly-once Semantics
Flink provides a excatly-once semantic guarantee for stateful computations, which means that the application can maintain a collection or aggregation of processed data, and Flink's checkpointing mechanism is to ensure that the application state is restored in the event of a failure. The semantics of Excatly-once
Event Time Semantics
The Flink supports the event time semantics of stream processing and windowing, and event time simplifies the calculation of exact results on a stream for a transaction that is out of order or for a deferred-arriving transaction
Flexible Windowing
In addition to data-driven windows,flink support for time-based, count-or session-windowing, you can customize windowing trigger conditions to support complex flow patterns, Flink windowing provides a way to simulate the environment when data is created
the lightweight Fault tolerant
Flink provides lightweight Fault tolerant to support both high throughput and excatly-once semantics, Flink can recover from errors without data loss (zero-data loss), which does not affect Flink reliability and performance
high throughput and low latency
savepoint mechanism
Flink's savepoint mechanism provides the ability to version the vanity stand to support uploading applications and re-processing historical data without losing status and requiring only a short outage
Distributed Support
Flink can be deployed and run on thousands of nodes, providing support for running on Mesos and YARN
batch compatibility with Apache Flink
Flink using the DataStream API to process unbounded datasets and working with the DataSet API for bounded datasets
Under the Flink framework, the bounded dataset can be considered a special case of the unbounded dataset, which is how the dataset API handles the bounded dataset, which treats the bounded dataset as a finite stream
Flink working with bounded and unbounded datasets in a similar way
[Essay] Apache Flink: Very reliable, one point not bad