The data exchange mechanism between tasks in Flink

Source: Internet
Author: User

The data exchange in Flink is built on the following two design principles:

    • The control flow of the data exchange (for example, the message transmission for an instantiated interchange) is initiated by the receiving side, much like the original MapReduce.
    • The data flow of data exchange (for example, the data that is eventually transmitted over the network) is abstracted into a concept called intermediateresult, which is pluggable. This means that the system can support both streaming data and the transfer of batch data based on the same implementation logic.

The data transfer contains multiple objects, which are:

    • The JobManager master node is used to respond to task scheduling, recovery, collaboration, and to hold the entire graph structure of the job through the EXECUTIONGRAPH data structure.
    • TaskManager worker node, a TaskManager (TM) executes multiple tasks concurrently in multiple threads. Each TM also contains a Communicationmanager (CM-shared between tasks), and a Memorymanager (MM-also shared between tasks). The TM can exchange data with each other through a standard TCP connection, which is created when communication is required.

Note that in Flink, it is taskmanager instead of a task to exchange data on the network. For example, a task within the same TM, where the data exchange between them is based on multiplexing on a network connection (TaskManager creation and maintenance).

Executiongraph: The execution diagram is a data structure that contains the "ground truth" of the job calculation. It contains the node (Executionvertex, which represents the calculation task), and the intermediate result (Intermediateresultpartition, which represents the data produced by the task). Nodes connect to the intermediate results they want to consume by Executionedge (EE):

These are the logical data structures that exist in JobManager. They have a run-time equivalent data structure in TaskManager to handle the final data processing. At run time, the equivalent data structure of intermediateresultpartition is called Resultpartition.

The Resultpartition (RP) represents the Bufferwriter write data chunk . An RP is a collection of resultsubpartition (RS). This is to distinguish between data defined by different recipients, such as a shuffle for a reduce or a join partition.

Resultsubpartition (RS) represents a partition of the data created by a operator, which is transmitted to the receiving operator along with the data logic to be transmitted. The specific implementation of RS determines the final data transfer logic, which is designed as a plug-in mechanism to meet the various data transmission requirements of the system. For example, PipelinedSubpartition an implementation of a pipeline that supports streaming data exchange. Instead, SpillableSubpartition it is a block data implementation that supports batch processing.

Inputgate: On the receiving end, it is logically equivalent to the RP. It is used to process and collect data from the upstream buffer.

Inputchannel: On the receiving end, it is logically equivalent to Rs. The data that is used to receive a particular partition.

Buffer: See Memory-management

The serializer, the Deserializer, is used to reliably convert typed data into pure binary data, processing data across the buffer.

Control flow for data exchange

Represents a simple map-reduce job and has two parallel tasks. We have two TaskManager, each taskmanager has two tasks (a map, a reduce), two TaskManager running on two different nodes, and one JobManager running on a third-party node. We focus on the transfer initialization between task M1 and R2. The data transfer is indicated by a thick arrow, and the message is represented by a thin arrow. First, M1 produces a resultpartition (RP1) (arrow 1). When the RP becomes accessible to the consumer (which we'll discuss later), it notifies JobManager (Arrow 2). JobManager notifies recipients (task R1 and R2) that they want to receive the partition data are now ready. If the recipient has not been dispatched, this will trigger the task's deployment (arrow 3a,3b). The recipient will then request data (arrow 4a,4b) to the RP. This initializes the data transfer between the tasks (5a,5b), which is either local (5a) or transmitted via the TaskManager network stack (5b). This mechanism gives the RP ample latitude when deciding when to inform JobManager that he is already in the ready state. For example, if RP1 wants to wait until the data is fully transmitted before notifying JM (for example, it writes data to a temporary file), the data exchange mechanism is roughly equivalent to a batched data exchange, as implemented in Hadoop. And if RP1 notifies JobManager when its first record is ready, then I have a streaming data exchange.

Transfer of byte buffers between two tasks

The above diagram shows a more detailed process that describes the entire lifecycle of data from the producer to the consumer. Initially, mapdriver production data records (collected through collector) are passed to the Recordwriter object. The recordwriter contains a set of serializers (Recordserializer objects). Consumer tasks may consume this data. A channelselector selects one or more serializers to process records. If they are recorded in broadcast, they are passed to each of the serializers. If the record is based on a hash partition, Channelselector will calculate the recorded hash value and then select the appropriate serializer.

The serializer serializes the data record into a binary representation. They are then placed in a buffer of the appropriate size (records can also be cut into multiple buffer). These buffer are first passed to Bufferwriter and then written to a resulepartition (RP). An RP contains multiple subpartition (RESULTSUBPARTITION-RS) that are used to collect buffer data for a specific consumer. This buffer in is defined for the reducer in TaskManager2 and is then placed in RS2. Now that the first buffer comes in, RS2 becomes accessible to the consumer (note that this behavior implements a streaming shuffle), and then it notifies JobManager.

JobManager looks for RS2 consumers and then notifies TaskManager that 21 blocks of data are already accessible. Notifies TM2 that a message will be sent to Inputchannel, which Inputchannel is considered to be receiving this buffer, and then notifies RS2 that it can initialize a network transmission. The buffer is then requested by the RS2 through the TM1 network stack, and both parties are prepared to transmit data based on Netty. A network connection exists over a long period of time between TaskManager (rather than a specific task).

Once buffer is received by TM2, it passes through a similar stack of objects, starting at Inputchannel (the receiving end is equivalent to IRPQ), entering Inputgate (which contains multiple ICS) and eventually entering a recorddeserializer, It is used to restore a typed record from buffer and then pass it to the receiving task, in this case reducedriver.

This article translated from: Https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks

Scan attention Public Number: Apache_flink

The data exchange mechanism between tasks in Flink

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.