Flink Study notes: 2, Flink Introduction

Source: Internet
Author: User
Tags manual sessions shuffle unique id
2, Flink introduction

Some of you might has been already using Apache Spark in your day-to-day life and might has been wondering if I have SPA RK Why am I need to use Flink? The question is quite expected and the comparison are natural. Let me try to answer this in brief. The very first thing we need to understand this is Flink are based on the streaming first principle which means it's real Streaming processing engine and not a fast processing engine, that collects streams as mini batches. Flink considers batch processing as a special case of streaming whereas it's vice-versa in the case of Spark.

General Meaning: Maybe we have already used Apache Spark in our work, so why do we need to use Flink now? This is because Flink is based on the flow-first principle, which means that it is a true streaming engine, rather than a fast batch engine that collects streams in small batches. Flink that batching is a special case of flow, while Spark is the reverse. The flow processing nature of spark is micro-batch processing. 2.1 History

Flink started as a, project named Stratosphere with the goal of building a next generation Big data analytics p Latform at universities in the Berlin area. It was accepted as a Apache incubator project on April 16, 2014. The
Flink began as a research project called Stratosphere, with the goal of setting up next-generation Big data analytics platforms at universities in the Berlin area. April 16, 2014 was accepted as the Apache incubator project.

From version 0.6, Stratosphere was renamed Flink. The latest versions of Flink is focused on supporting various features such as batch processing,stream processing, graph Processing, machine learning, and so on. Flink 0.7 introduced the most important feature of Flink, this is, Flink ' s streaming API. Initially release only had the Java API. Later releases started supporting Scala API as well. Now let's look at the current architecture of Flink on the next section.
Starting with version 0.6, Stratosphere renamed Flink. Flink's latest release focuses on various functions such as batch processing, streaming, graphics processing, machine learning, and more. Flink 0.7 introduces the most important feature of Flink, the Flink streaming API. The original version was only the Java API. Later versions also began to support the Scala API. Now let's look at the current architecture of Flink in the next section. 2.2 Architecture (architecture)

Flink 1.X ' s architecture consists of various components such as deploy, core processing, and APIs. The following diagram shows the components, APIs, and libraries:

The architecture of the

Flink 1.X consists of a variety of components, such as deployment, core processing, and APIs. The following illustration shows the components, APIs, and libraries:

Flink has a layered architecture where each component are a part of a specific layer. Each layer was built on top of the others for clear abstraction. Flink is designed to run on local machines, in a YARN cluster, or on the cloud. Runtime is Flink's core data processing engine that receives the program through APIs in the form of jobgraph. Jobgraph is a simple parallel data flow with a set of tasks, that produce and consume data streams. The
Flink has a hierarchical schema in which each component is part of a specific layer. Each layer is built on top of the other layers for a clear abstraction. The Flink is designed to run on a local machine, yarn cluster, or cloud. Runtime is the core data processing engine of Flink, which receives programs through the API in the form of jobgraph. Jobgraph is a simple parallel data stream that contains a set of tasks that generate and consume data streams.

The DataStream and DataSet APIs is the interfaces programmers can use for defining the Job. Jobgraphs is generated by these APIs when the programs is compiled. Once compiled, the DataSet API allows the optimizer to generate the optimal execution plan while DataStream API uses a str EAM build for efficient execution plans.
The DataStream and dataset APIs are interfaces that programmers can use to define a job. Jobgraphs are generated by these APIs when the program is compiled. After compiling, the DataSet API allows the optimizer to generate the best execution plan, while the DataStream API uses stream generation to achieve an efficient execution plan.

The optimized Jobgraph is then submitted to the executors according to the deployment model. You can choose a local, remote, or YARN mode of deployment. If you have a Hadoop cluster already running, it's always better to use a YARN mode of deployment.
The optimized jobgraph are then submitted to the performer based on the deployment model. You can select local, remote, or yarn deployment mode. If your Hadoop cluster is already running, it's a good idea to use yarn deployment mode. 2.3 Distributed Execution (distributed execution)

Flink ' s distributed execution consists of the important processes, master and worker. When a Flink program was executed, various processes take part in the execution, namely job Manager, Task Manager, and job Client.
Flink's distributed execution consists of two important processes, master process and worker process. When executing the Flink program, each process participates in execution, that is, the job Manager, Task Manager, and job client.

The following diagram shows the Flink program execution:
The following figure shows the execution of the Flink program:

The Flink program needs to being submitted to a Job Client. The job Client then submits the job to the job Manager. It's the job Manager ' s responsibility to orchestrate the resource allocation and job execution. The very first thing it does is allocate the required resources. Once The resource allocation is do, the task is submitted to the respective the task Manager. On receiving the task, the task Manager initiates a thread to start the execution. While the execution are in place, the Task Managers keep on reporting the change of States to the Job Manager. There can various states such as starting the execution, in progress, or finished. Once the job execution is complete, the results was sent back to the client.

The Flink program needs to be submitted to the job client. The job client then submits the job to the job manager. The job manager has the responsibility to orchestrate resource allocation and job execution. The first thing it does is allocate the required resources. Once the resource allocation is complete, the task is submitted to the appropriate task manager. When a task is received, Task Manager initiates a thread to begin execution. The Task Manager continuously reports status changes to the job manager while the execution is in place. Can have various states, such as start execution, in progress, or finish. When the job execution completes, the results are sent back to the client. 2.3.1 Job Manager

The master processes, also known as Job Managers, coordinate and manage the execution of the program. Their main responsibilities include scheduling tasks, managing checkpoints,failure recovery, and so on.
The master process, also known as the job manager, is responsible for coordinating and managing the execution of the program. Their main responsibilities include scheduling tasks, managing checkpoints, and recovering from failures.

There can many Masters running in parallel and sharing these responsibilities. This helps in achieving high availability. One of the Masters needs to be the leader. If the leader node goes down, the master node (standby) would be elected as leader.
Many master can work in parallel and share these responsibilities. This helps to achieve high availability. One of the master needs to be a leader. If the leader node is closed, the master node (standby) is selected as leader.

The Job Manager consists of the following important components:

Actorsystem, Scheduler, Check pointing
The job Manager contains the following important components:
Actorsystem (actor system), dispatch, Check Point

Flink internally uses the Akka actor system for communication between the Job Managers and the Task Managers.
Flink internally uses the Akka role system to manage Job manager and Task Manager. 2.3.2 Actor System

An actor system was a container of actors with various roles. IT provides services such as scheduling, configuration, logging, and so on. It also contains a thread pool from the where all actors is initiated. All actors reside in a hierarchy. Each newly created actor would is assigned to a parent. Actors talk to each of the other using a messaging system. Each actor have its own mailbox from where it reads all the messages. If the actors is local, the messages is shared through shared memory But if the actors is remote then messages is pass Ed thought RPC calls.
The actor system is a container for actors with various roles (container). It provides services such as scheduling, configuration, logging, and so on. It also contains a thread pool, starting with all the roles. All actors reside in a hierarchical structure. Each newly created actor will be assigned to the parent. Actors use information systems to talk to each other. Each participant has its own mailbox, from which all messages are read. If the contributor is local, the message is shared through shared memory, but if the contributor is remote, the RPC call message is considered.

Each parent was responsible for the supervision of its children. If any error happens with the children, the parent gets notified. If the actor can solve its own problem then it can restart it children. If it cannot solve the problem then it can escalate the issue to its own parent:
Each parent is responsible for supervising his/her children. If your child has any errors, the parent will receive a notification. If an actor can solve his own problem, then it can restart the child. If it doesn't solve the problem, it can escalate the problem to its own parents:

In Flink, an actor is a container has state and behavior. An actor's thread sequentially keeps on processing the messages it'll receive in its mailbox. The state and the behavior is determined by the message it has received.
In Flink, the actor is a container with state and behavior. The thread of an actor continues to process the message it will receive in the mailbox in turn. The status and behavior are determined by the information received. 3.3 Scheduler (Dispatch)

Executors in Flink is defined as task slots. Each task Manager needs to manage one or more task slots. Internally, Flink decides which tasks needs to share the slot and which tasks must is placed into a specific slot. It defines that through the Slotsharinggroup and Colocationgroup.
Performers in the Flink are defined as task slots. Each task manager needs to manage one or more task slots. Internally, Flink determines which tasks need to share the slot and which tasks must be placed in a particular slot. It passes through Slotsharinggroup and Colocationgroup. 3.4 Check Pointing (check point)

Check pointing is Flink ' s backbone for providing consistent fault tolerance. It keeps on taking consistent snapshots for distributed data streams and executor states. It is inspired by the chandy-lamport algorithm but have been modified for Flink ' s tailored requirement.
Check to point to the trunk that Flink provides consistent fault tolerance. It always provides a consistent snapshot of both the distributed data stream and the executor state. It is inspired by the chandy-lamport algorithm, but has been modified according to Flink's custom requirements.

The fault-tolerant mechanism keeps on creating lightweight snapshots for the data flows. They therefore continue the functionality without any significant over-burden. Generally the state of the data flow was kept in a configured place such as HDFS.
The fault tolerance mechanism always creates a lightweight snapshot of the data flow. As a result, they continue to function without any significant burden. Typically, the state of the data stream is stored in a configuration such as HDFs.

In case of any failure, Flink stops the executors and resets them and starts executing from the latest available checkpoin T.
In the event of any failure, Flink stops executing the program and resets them and executes from the most recent available checkpoint.

Stream barriers is core elements of Flink ' s snapshots. They is ingested into data streams without affecting the flow. Barriers never overtake the records. They group sets of records into a snapshot. Each barrier carries a unique ID. The following diagram shows how the barriers is injected into the data stream for snapshots:
Traffic barriers are the core elements of Flink snapshots. They are ingested data streams without affecting traffic. Obstacles never exceed records. They split a set of records into snapshots. Each barrier comes with a unique ID. The following figure shows how to inject a barrier into the data flow of a snapshot:

Each of the snapshot state are reported to the Flink Job Manager ' s checkpoint coordinator. While drawing snapshots, Flink handles the alignment of records on order to avoid re-processing the same records because O F any failure. This alignment generally takes some milliseconds. But for some intense applications, where even millisecond latency are not acceptable, we have a option to choose low Laten Cy over exactly a single record processing. By default, the Flink processes each record exactly once. If any application needs low latency and are fine with at least a single delivery, we can switch off that trigger. This would skip the alignment and would improve the latency.

Report each snapshot status to the checkpoint coordinator of the Flink job manager. When you draw a snapshot, Flink handles record alignment to avoid re-processing the same records due to any failures. This alignment usually takes a few milliseconds. But for some drastic applications, even if the millisecond delay is unacceptable, we can choose to choose a low latency in one record processing. By default, Flink only processes one record. If any application requires low latency, and at least one delivery is possible, we can close the trigger. This will skip alignment and will improve latency. 2.3.5 Task Manager

The

Task managers is worker nodes that, the tasks in one, or more threads in the JVM. Parallelism of task execution is determined by the task slots available on each task Manager. Each task represents a set of resources allocated to the task slot. For example, if a Task Manager had four slots then it would allocate 25% of the memory to each slot. There could is one or more threads running in a task slot. Threads in the same slot share the same JVM. Tasks in the same JVM share TCP connections and heart beat messages:
Task Manager is a worker node that performs tasks in one or more threads in the JVM. The parallelism of task execution is determined by the task slots available on each task Manager. Each task represents a set of resources assigned to a task slot. For example, if the task Manager has four slots, it allocates 25% of the memory for each slot. There may be one or more threads running in the task slot. Threads in the same slot share the same JVM. Tasks in the same JVM share a TCP connection and heartbeat message:
2.3.6 Job Client

The job client is a internal part of Flink's program execution but it's the starting point of the execution. The job client is responsible-accepting the program from the user and then creating a data flow and then submitting th E Data flow to the Job Manager for further execution. Once the execution is completed and the job client provides the results back to the user. A data flow is a plan of execution. Consider a very simple word count program:
The job client is not an internal part of the Flink program execution, but it is the starting point for execution. The job client is responsible for accepting programs from the user, then creating the data flow, and then submitting the data flow to the job manager for further execution. Once execution is complete, the job client provides the results to the user. The data flow is an execution plan. Consider a very simple word-count program:

Val text=env.readtextfile ("Input.txt")  //source
val counts=text.flatmap{_.tolowercase.split ("\\w+") Filter{_.nonempty}}.
           map{(_,1)}.
           groupBy (0)
           . SUM (1)                       //transformation  
counts.writeascsv (" Output.txt "," \ n "," ")//sink

When a-client accepts the program from the user, it then transforms it into a data flow. The data flow for the aforementioned.
When a client accepts a program from a user, it transforms it into a data stream. The data flow for the above program might look like this:

The preceding diagram shows how a program gets transformed into a data flow. Flink data flows is parallel and distributed by default. For parallel data processing, Flink partitions the operators and streams. Operator partitions is called sub-tasks. Streams can distribute the data in a one-to-one or a re-distributed manner.
The figure above shows how the program is converted to a data stream. Flink data streams are distributed by default in parallel. For parallel data processing, flink split operators and streams. The operator partition is referred to as a child task. Streams can distribute data in a one-to-one or redistribution fashion.

The data flows directly from the source to the map operators as there are no need to shuffle the data. But for a GroupBy operation Flink could need to redistribute the data by keys in order to get the correct results:
The data is directly from the source to the map operator because no data is required to be mixed. But for groupby operations, Flink may need to reassign data through the keys to get the correct results:
2.4 Features

In the earlier sections, we tried to understand the Flink architecture and its execution model. Because of its robust architecture, Flink are full of various features.
In the previous chapters, we tried to understand the flink architecture and its execution model. Thanks to its powerful architecture, Flink is full of features. 2.4.1 High Performance (performance)

Flink is designed to achieve high performance and low latency. Unlike other streaming frameworks such as Spark, you don ' t need to do many manual configurations to get the best performan Ce. Flink ' s pipelined data processing gives better performance compared to its counterparts.
The Flink is designed for high performance and low latency. Unlike other streaming frameworks like spark, you don't need to perform many manual configurations to get the best performance. Compared with the peers, Flink's pipeline data processing performance is better. 2.4.2 Exactly-once stateful computation (exact one stateful calculation)

As we discussed in the previous sections, Flink ' s distributed checkpoint processing helps to guarantee processing each RECO Rd exactly once. In the case of High-throughput applications, Flink provides us and a switch to allow at least once processing.
As we discussed in the previous section, distributed checkpoint processing by Flink helps ensure that each record is processed only once. In the case of high-throughput applications, the Flink provides us with a switch that allows at least one processing. 2.4.3 Flexible streaming windows (Flexible streaming window)

Flink supports Data-driven windows. This means we can design a window based on time,counts, or sessions. A window can also is customized which allows us to detect specific patterns in event streams.
Flink supports data-driven windows. This means that we can design a window based on time, count or conversation. You can also customize the window so that we can detect specific patterns in the event stream. 2.4.4 Fault tolerance (fault tolerant)

Flink ' s distributed, lightweight snapshot mechanism helps in achieving a great degree of fault tolerance. It allows Flink to provide high-throughput performance with guaranteed delivery.
Flink's distributed lightweight snapshot mechanism helps to achieve high levels of fault tolerance. It allows Flink to deliver high throughput performance and guaranteed delivery. 2.4.5 Memory management (RAM management)

Flink is supplied with its own memory management inside a JVM which makes it independent of Java ' s default garbage collect Or. It efficiently does memory management by using hashing,indexing, caching, and sorting.
Flink provides its own memory management within the JVM, making it independent of the Java default garbage collector. It effectively manages memory by using hashing, indexing, caching, and sorting. 2.4.6 Optimizer (optimized)

Flink ' s batch data processing API is optimized in order to avoid memory-consuming operations such as shuffle, sort, and so On. It also makes sure this caching is used in order to avoid heavy disk IO operations.
Flink's batch processing API is optimized to avoid memory-intensive operations such as shuffling, sorting, and so on. It also ensures that caching is used to avoid heavy IO operations. 2.4.7 Stream and batch in one platform (stream and batch on a single platform)

Flink provides APIs for both batch and stream data processing. So once your set up the Flink environment, it can host stream and batch processing applications easily. In fact Flink works on streaming first principle and considers batch processing as the special case of streaming.
Flink provides APIs for batch processing and streaming data processing. So once you've built the Flink environment, it can easily host streaming and batch applications. In fact, Flink works by streaming and treating batching as a special case of streaming. 2.4.8 Libraries (library)

Flink have a rich set of libraries to does machine learning, graph processing, relational data processing, and so on. Because of its architecture, it's very easy to perform complex event processing and alerting. We are going to see more about these libraries in subsequent chapters.
Flink has a rich library to do machine learning, graphics processing, relational data processing and so on. Because of its architecture, it is easy to perform complex event handling and alerting. We'll see more information about these libraries in the sections that follow. 2.4.9 Event Time semantics (temporal semantics of events)

Flink supports event time semantics. This helps in processing streams where events arrive out of order. Sometimes events may come delayed. Flink ' s architecture allows us to define windows based on time, counts, and sessions, which helps in dealing with such SCE Narios.
Flink supports event time semantics. This helps handle the flow of events that arrive out of order. Sometimes events may be delayed. The Flink architecture allows us to define windows based on time, count and session, which helps to handle this situation.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.