Flink: Next generation stream computing platform?

Source: Internet
Author: User
Tags flush pack serialization shuffle
Brief Introduction

Flink is a distributed engine based on streaming computing, formerly known as Stratosphere, which began in 2010 at a university in Germany, and has a history of several years, drawing on the ideas of other projects in the community for 2014 of years, rapidly developing and entering the Apache top incubator.

Spark supports batch and stream calculations (cutting streams into small batches) at the bottom of the batch, and the Flink supports both batch and stream computations with the underlying engine of the flow calculation, Flink provides the dataset and DataStream APIs, and provides a library of machine learning and graph calculation on it. Recently, Flink is also developing SQL-related APIs (Table), similar to the dataframe in Spark.

(This article default reader has a certain flow calculation basis, storm experience)


Advantages 1. High throughput and low delay
In simple case, the Flink huff and puff can be 10 times times the storm, and the delay is within the millisecond (100ms). Storm provides real-time computations in the form of the resident task mode of the record by record, which can be delayed in dozens of milliseconds or even 10 milliseconds, but with a lower throughput. Trident the data stream into batches, but it still executes at the record level. So in storm, the effective way to increase throughput is to pack. Spark streaming is different from the resident task mode, it directly divided the data into discrete flow, so that the flow calculation into a batch calculation, each batch is scheduled to run, improve throughput, but the delay depends on the fixed time period, time period can not be set too small, Because each batch will have the cost of task scheduling, and so on, General spark the official proposed 2-5s time period division.
Flink also provides a pattern of resident tasks, and in order to overcome the throughput problems in Recordbyrecord, Flink sends data according to memory slices. Memory chip default 32kb,flink the data to be sent to the memory slice, when the memory chip is full or timeout, the memory chip sent out, there is actually a packaging logic, but flink different from the storm user Simply pack, Flink the data directly into the managed memory byte array, so that these business objects are quickly younggc off and do not cause FULLGC frequent. On the other hand, Flink's high throughput also benefits from its serialization, through datastreamapi,flink understanding of the business data schema, resulting in a customized serializer to avoid expensive Kryo serialization (type information, various reflection). Problem: Packaging that brings a certain delay (when the amount of data is not large when the timeout flush, of course, you can set the timeout to 0, so that each write will be the data flush), to achieve when the topology is very deep, the delay will increase linearly. But Flink will optimize multiple transform combinations without shuffle (fieldgroup in Storm), so many of the topologies in Storm have been streamlined in Flink, and the topology is not very deep. (Storm 1.0 also do packaging related work, but the idea is not point-to-point packaging, but disruptorqueue packaging, performance can be 3-10 times the upgrade)
2.exactly-once Computing and state managementThe general ExactlyOnce semantic requirements of the state of the transactional, inspired by the chandy-lamport algorithm, flink to solve the exactly-once idea is timed to a consistent distributed snapshot checkpoint, and then after the error, Revert to the previous consistency of the snapshot, which is consistent with the basic idea of Trident, but the specific implementation of a certain difference. Trident's transactions are divided into two phases: 1. Parallel computing: This phase is pure calculation, does not involve the change of state, there is no relationship between the batches. 2. Status commit: The state to the external storage, this phase is completely serial, so the external engine simple incremental kv update can be, without the concept of multiple versions (here only discuss transaction semantics, excluding opaque). Trident is the master control of the entire calculation, master notify the source to send data before a batch of calculations, and master does not know whether the current data, which led to the empty batch problem, when there is no data, a variety of coordination messages sent. In this respect, Trident adds a parameter that restricts the minimum interval between batches and batches, but this directly results in data latency depending on the setting of this parameter (similar to spark streaming).

Flink transaction: Unlike Trident, Flink master does not control when to send data, it only controls when the batch is divided. Master sends the barrier control flow, dividing the source data into batches, and source can send data at any time.
Master uses the Barrier command to control each node to generate a distributed snapshot (full build, multiple versions), and master will clean up unwanted versions with a certain policy (the latest savepoint can even allow users to manage these distributed snapshot versions like Git). The propagation of barrier is similar to the communication of coordinated message in Trident, if upstream n concurrency, downstream m concurrency, a total of n*m message volume is required to propagate. The detailed process refer to the following figure:
But this poses a problem: If a node has concurrent a and concurrent B, its downstream node has concurrent C. 1. When a finished the 1th batch of Barrier1 sent to C, then B has not processed the first batch. 2.A has finished processing the 2nd batch of Barrier2 sent to C, then B has not processed the first batch. 3. C already got the first and second batch of data from a, when it was not able to process the second batch of data, because the first batch of data is not complete, but also can not constitute a distributed consistency snapshot, so it needs to send a data based on BARRIER2 to block live. So it's critical or distributed consistency that requires data alignment. And Flink relies on barrier this mechanism for distributed consistency alignment.
The figure describes the 4 states of the node: 1. Start alignment, when you have received the data for a, but B's data has not come, you need to align. 2. Receive the next batch of data from upstream a to block (i.e. cache) the data. 3. After receiving this batch of data from B, start checkpoint and send barrier to its downstream. 4. Continue with the next batch of alignments. The barrier is sent entirely according to the time interval, and does not affect the data delay completely, the user may set the interval according to the recovery time demand. Flink the state in checkpoint and then asynchronously checkpoint, as much as possible without affecting the normal calculation.
Problem 1: The exception causes Flink to redeploy the entire job (because barrier is not easily designed rollback), but the Flink is the threading model, redeploy does not destroy the process, so the price is very low ( Similar sparkstreaming will be deploy once per batch of calculations. Problem 2: The state is not flexible, strong requirements key is the key in shuffle, this can rely on hack code to solve. Question 3: The current state of the checkpoint is still full volume, and does not support incremental checkpoint.
3. Progressive reverse pressureThe source of the data in the flow calculation is unstable, it is possible to suddenly appear a large amount of data, then the whole topology will be hung (memory oom, etc.). So the flow control method is needed in the stream calculation, and the spout maxpending is used in the storm to control the flow. The problem is that the user is not easy to set this value, set small, the entire operating throughput does not go (did not reach the full pipeline calculation), set large, easy to lead to hang off, thereby being stuck to death. In the 2014 Storm paper, Twitter describes a auto-tuning maxpending setting mechanism that dynamically adjusts spout maxpending parameters based on the amount of data returned by the current window ack. And 2015 Twitter proposed a better solution to the problem: using Heron, when the downstream node is not able to withstand the upstream node, so that source (spout) Send data reduction. But Heron and storm, the bottom are all shared connections, if the direct use of TCP reverse pressure, will lead to the problem of distributed deadlock, so they are only implemented spout back pressure (direct notification spout).
The Flink task and task are independent of a TCP connection, so that Flink can rely directly on TCP's reverse pressure. The downside is a large increase in the number of underlying connections. The advantage is to reduce the vibration problem, spout back pressure, when the downstream occurred congestion, direct notification spout,spout stop sending data. However, the spout data will be a long time to respond to downstream, there will be a long time to alleviate, when the reverse pressure, spout sent a large number of data, but also caused downstream congestion. This phenomenon is the node in the congestion-idle state switch back and forth, seriously affecting throughput. If you can directly reverse the upstream, the impact of such shocks will be much smaller (further fine control will continue to reduce the vibration).
Problem: Rely on TCP to reverse pressure requires task point-to-point to establish a single connection, so when the concurrency is high, the number of connections n*m will be too large, in this case, flink rely on high throughput, can make the operation of the concurrent settings is not so large.

4. Based on yarn and threading modelYarn means that resource isolation is easier to deploy and manage, local interaction with the storage engine (hadoop/hbase, etc.), and so on. Flink run jobs are divided into two steps: 1. Create Appmaster, request resources, and start the process. 2. Submit the assignment. So Flink runs the job in the form of a threading model, a process that starts well, and when a user submits a job, it is assigned to run on each process. The benefits of this design are mainly two: 1. Greatly speed up the operation of the operation, if the distribution of good jar package, starting speed is millisecond level. 2. Flink the idea that batch and stream share a single piece of data (and even queries) by sharing data between jobs through the JVM. Problem: The threading model allows each job to share the JVM, which can also cause them to interfere with each other, at which point Flink gives the option to the user, who can choose either a Flinkapp or a different app.
5. Flexible window CalculationYou can use three of the time in Flink to perform a window operation: 1.EventTime: An event with a time field. 2.ProcessingTime: The system time when the window is calculated. 3.IngestionTime: Event Flink time (source to get data time), can be considered as a eventtime.
In the process of triggering computation and lateevent, Flink uses the watermark mechanism (the concept of wheelmill/dataflow). Because the time of each node data in the distributed environment is uncertain, a statistic message is needed to determine the time, which is the meaning of watermark, which is used to divide the boundary of the window. When the node gets the watermark, it can calculate the output and clear the window according to the Saving data of the window, and then the lateevent can be thrown directly or the update history value (the specific logic is defined by the user). In terms of implementation, each source generates its own watermark, downstream nodes receive watermark, and according to all upstream watermark, use the smallest as their time and broadcast to the downstream.
(somewhat similar to barrier feeling)
Flink window is very flexible, users can realize their own window definition, can also directly use Tumblingwindow, Slidingwindow, Sessionwindow. The trigger that triggers the output can be defined according to the data/internal timer/watermark, and the trigger that triggers the cleanup data. On a window, you can perform a join operation for multiple streams. Flink uses the statecheckpoint mechanism above to handle fault tolerance for the data in the window.
Flink is still in perfect, from the model is to be more advanced than storm, further reference to the official documents and source code.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.