Critical method chain for submitting process calls
The user-written program logic needs to be submitted to Flink to be executed. This article explores how the client program is submitted to Flink. Our analysis is based on this scenario, as users write their own logic into the appropriate application package (such as a jar) using the Flink API and then submit it to a target Flink cluster to run in a more mainstream usage scenario.
The Flink API has different objects for different execution environments Environment
, where we are mainly based on common RemoteStreamEnvironment
and RemoteEnvironment
analytical
In front of us we talked about the "lazy evaluation" implemented in Flink, which only execute
"really" starts when the method is finally called. Therefore, the execute
method is our starting point.
Its source code is locatedorg.apache.flink.streaming.api.environment.RemoteStreamEnvironment
First, let's take a look at the execute
key method invocation chain whose method triggers:
Based on the call chain, we analyze these key methods, of course, some of the details of the content we may temporarily skip, so as to ensure that the main path is always clear.
getStreamGraph
The method is used to obtain an StreamGraph
instance that represents the complete topology of the stream and contains the necessary information about the build JobGraph
(including source
, sink
the set of "node" abstractions in the diagram, some virtual mappings, the configuration of execution and checkpoints, and so on).
StreamGraph
once obtained, executeRemotely
remote execution is performed by invoking the method. The method first creates a class loader that loads user code based on the path to the user package that was obtained and the classpath:
ClassLoader usercodeClassLoader = JobWithJars.buildUserCodeClassLoader(jarFiles, globalClasspaths, getClass().getClassLoader());
The client object is then built according to the configuration (the client object is the internal proxy that really jobmanager the docking):
client;try { clientnew Client(configuration); client.setPrintStatusDuringExecution(getConfig().isSysoutLoggingEnabled());}catch (Exception e) { thrownew ProgramInvocationException("Cannot establish connection to JobManager: " + e.getMessage(), e);}
The following things are taken over in this case Client
:
try {return client . Runblocking (Streamgraph, Jarfiles, Globalclasspaths, Usercodeclassloader);} catch (Programinvocationexception e) {throw e;} catch (Exception e) {String term = e.getmessage () = = null ? : ( ":" + e.getmessage ()); throw new Programinvocationexception ( "The program execution failed" + term, e);} finally {client . Shutdown ();}
client
Object calls the runBlocking
JobExecutionResult
execution result of a blocking behavior that "runs" the user program and waits for the return object to be returned Job
. Execution completes, and finally in the finally
block, the calling shutdown
method closes and frees the resource.
runBlocking
After being called, the call chain jumps into the client class. runBlocking
There are a number of overloads for the method to fit multiple submission modes and operating modes. In the current remote execution environment, the runBlocking
instances obtained by the method are called during jumps between multiple overloaded methods getJobGraph
JobGraph
. JobGraph
represents the Flink dataflow
program, which will be JobManager
understood and received. The Job
JobManager
high-level APIs provided by Flink will be translated into representations before they are submitted JobGraph
. We'll analyze how to get the jobgraph implementation. Here, let's ignore these details and move on to the next key approach.
runBlocking_1
is actually runBlocking
the method of overloading, here plus a suffix identifier, just to follow the above to runBlocking
distinguish. In the Runblocking_1 method, the first use of LeaderRetrievalUtils
LeaderRetrievalService
This service object is created:
LeaderRetrievalService leaderRetrievalService;try { catch (Exception e) { thrownew ProgramInvocationException("Could not create the leader retrieval service.", e);}
As the name implies, the LeaderRetrievalService
service that finds the master node is provided in Flink. It determines which recovery mechanism is based on the Flink configuration information, primarily Recovery.mode, to create the service. There are currently two modes: one is Standalone
the standalone mode of operation and the other is based on Zookeeper
the high availability mode. Flink provides a LeaderRetrievalListener
callback interface called the Master node to obtain the information. Next, the called JobClient
submitJobAndWait
method submits the resulting information about the JobGraph
service object that the master node looks for JobManager
and waits for the result to be returned:
try { this.lastJobID = jobGraph.getJobID(); return JobClient.submitJobAndWait(actorSystem, leaderRetrievalService, jobGraph, catch (JobExecutionException e) { thrownew ProgramInvocationException("The program execution failed: " + e.getMessage(), e);}
The submitJobAndWait
first parameter of the method above actorSystem
is an ActorSystem
instance. Client
created when the object is constructed, closed by the method called after the job commits and obtains the returned results Client
shutdown
:
publicvoidshutdown() { if (!this.actorSystem.isTerminated()) { this.actorSystem.shutdown(); this.actorSystem.awaitTermination(); }}
The invocation of the method is shown in the executeRemotely
block of the finally statement of the code snippet above.
JobClient
May make you wonder-what does it have to do with it Client
? What is the role? The following can be used to explain these questions:
The image above shows the Client
relationship of the object to several other objects. JobClient
It has a bridging function that bridges synchronous method calls and asynchronous message traffic at the API-based programming level. More specifically, it JobClient
can be seen as a "static class" that provides some static methods, where we focus primarily on the above submitJobAndWait
method, which encapsulates the Actor
asynchronous communication between (the specific communication object is JobClientActor
that it is responsible JobManager
ActorSystem
Actor
for the to communicate) and return the result in the form of blocking. Instead of Client
JobClient
focusing on how it is implemented internally, simply call these methods.
By invoking JobClient
the static method submitJobAndWait
, the action based on the Akka
Actor
message communication is triggered to complete the subsequent commit jobgraph. The abstract of the JobClient
submitted Job
message-based interaction is as follows:
In general, there ActorSystem
are two, one belongs to Client
, the other belongs to JobManager
. In submitJobAndWait
a method, it first creates one JobClientActor
ActorRef
:
ActorRef jobClientActor = actorSystem.actorOf(jobClientActorProps);
It then initiates a SubmitJobAndWait
message that JobGraph
submits an instance of the message to jobClientActor
. The initiating mode of the message is ask, which indicates that a reply message is required.
The code snippet that jobclient sends a message to Jobclientactor is as follows:
Futurefuture = Patterns.ask(jobClientActor, new JobClientMessages.SubmitJobAndWait(jobGraph), newAwait.result(future, AkkaUtils.INF_TIMEOUT());
JobClient
Will block waiting for the future
return result. After the return result is answer, it is parsed to determine whether it is the result of a Job
successful return or a failed return.
Summary
At this point, the key method call path of the client submission streaming job has been combed through. In order to highlight the main route and avoid being disturbed by too much implementation detail, we temporarily overlook the interpretation of some important data structures and key concepts. However, we will analyze them later on.
Scan code Attention public number: Apache_flink
QQ Scan Code concern QQ Group: Apache Flink Learning Exchange Group (123414680)
Apache Flink Stream Job Submission Process analysis