Apache Flink Stream Job Submission Process analysis

Last Update:2016-07-19 Source: Internet

Author: User

Tags apache flink

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Critical method chain for submitting process calls

The user-written program logic needs to be submitted to Flink to be executed. This article explores how the client program is submitted to Flink. Our analysis is based on this scenario, as users write their own logic into the appropriate application package (such as a jar) using the Flink API and then submit it to a target Flink cluster to run in a more mainstream usage scenario.

The Flink API has different objects for different execution environments Environment , where we are mainly based on common RemoteStreamEnvironment and RemoteEnvironment analytical

In front of us we talked about the "lazy evaluation" implemented in Flink, which only execute "really" starts when the method is finally called. Therefore, the execute method is our starting point.

Its source code is locatedorg.apache.flink.streaming.api.environment.RemoteStreamEnvironment

First, let's take a look at the execute key method invocation chain whose method triggers:

Based on the call chain, we analyze these key methods, of course, some of the details of the content we may temporarily skip, so as to ensure that the main path is always clear.

getStreamGraphThe method is used to obtain an StreamGraph instance that represents the complete topology of the stream and contains the necessary information about the build JobGraph (including source , sink the set of "node" abstractions in the diagram, some virtual mappings, the configuration of execution and checkpoints, and so on).

StreamGraphonce obtained, executeRemotely remote execution is performed by invoking the method. The method first creates a class loader that loads user code based on the path to the user package that was obtained and the classpath:

ClassLoader usercodeClassLoader = JobWithJars.buildUserCodeClassLoader(jarFiles, globalClasspaths,   getClass().getClassLoader());

The client object is then built according to the configuration (the client object is the internal proxy that really jobmanager the docking):

client;try {       clientnew Client(configuration);       client.setPrintStatusDuringExecution(getConfig().isSysoutLoggingEnabled());}catch (Exception e) {   thrownew ProgramInvocationException("Cannot establish connection to JobManager: " + e.getMessage(), e);}

The following things are taken over in this case Client :

try  {return  client . Runblocking (Streamgraph, Jarfiles, Globalclasspaths, Usercodeclassloader);} catch  (Programinvocationexception e) {throw  e;} catch  (Exception e) {String term = e.getmessage () = = null ?        : ( ":"  + e.getmessage ()); throw  new  Programinvocationexception ( "The program execution failed"  + term, e);} finally  {client . Shutdown ();}

clientObject calls the runBlocking JobExecutionResult execution result of a blocking behavior that "runs" the user program and waits for the return object to be returned Job . Execution completes, and finally in the finally block, the calling shutdown method closes and frees the resource.

runBlockingAfter being called, the call chain jumps into the client class. runBlockingThere are a number of overloads for the method to fit multiple submission modes and operating modes. In the current remote execution environment, the runBlocking instances obtained by the method are called during jumps between multiple overloaded methods getJobGraph JobGraph . JobGraphrepresents the Flink dataflow program, which will be JobManager understood and received. The Job JobManager high-level APIs provided by Flink will be translated into representations before they are submitted JobGraph . We'll analyze how to get the jobgraph implementation. Here, let's ignore these details and move on to the next key approach.

runBlocking_1is actually runBlocking the method of overloading, here plus a suffix identifier, just to follow the above to runBlocking distinguish. In the Runblocking_1 method, the first use of LeaderRetrievalUtils LeaderRetrievalService This service object is created:

LeaderRetrievalService leaderRetrievalService;try {       catch (Exception e) {       thrownew ProgramInvocationException("Could not create the leader retrieval service.", e);}

As the name implies, the LeaderRetrievalService service that finds the master node is provided in Flink. It determines which recovery mechanism is based on the Flink configuration information, primarily Recovery.mode, to create the service. There are currently two modes: one is Standalone the standalone mode of operation and the other is based on Zookeeper the high availability mode. Flink provides a LeaderRetrievalListener callback interface called the Master node to obtain the information. Next, the called JobClient submitJobAndWait method submits the resulting information about the JobGraph service object that the master node looks for JobManager and waits for the result to be returned:

try {       this.lastJobID = jobGraph.getJobID();       return JobClient.submitJobAndWait(actorSystem, leaderRetrievalService, jobGraph,                                         catch (JobExecutionException e) {       thrownew ProgramInvocationException("The program execution failed: " + e.getMessage(), e);}

The submitJobAndWait first parameter of the method above actorSystem is an ActorSystem instance. Clientcreated when the object is constructed, closed by the method called after the job commits and obtains the returned results Client shutdown :

publicvoidshutdown() {       if (!this.actorSystem.isTerminated()) {              this.actorSystem.shutdown();              this.actorSystem.awaitTermination();       }}

The invocation of the method is shown in the executeRemotely block of the finally statement of the code snippet above.

JobClientMay make you wonder-what does it have to do with it Client ? What is the role? The following can be used to explain these questions:

The image above shows the Client relationship of the object to several other objects. JobClientIt has a bridging function that bridges synchronous method calls and asynchronous message traffic at the API-based programming level. More specifically, it JobClient can be seen as a "static class" that provides some static methods, where we focus primarily on the above submitJobAndWait method, which encapsulates the Actor asynchronous communication between (the specific communication object is JobClientActor that it is responsible JobManager ActorSystem Actor for the to communicate) and return the result in the form of blocking. Instead of Client JobClient focusing on how it is implemented internally, simply call these methods.

By invoking JobClient the static method submitJobAndWait , the action based on the Akka Actor message communication is triggered to complete the subsequent commit jobgraph. The abstract of the JobClient submitted Job message-based interaction is as follows:

In general, there ActorSystem are two, one belongs to Client , the other belongs to JobManager . In submitJobAndWait a method, it first creates one JobClientActor ActorRef :

ActorRef jobClientActor = actorSystem.actorOf(jobClientActorProps);

It then initiates a SubmitJobAndWait message that JobGraph submits an instance of the message to jobClientActor . The initiating mode of the message is ask, which indicates that a reply message is required.

The code snippet that jobclient sends a message to Jobclientactor is as follows:

Futurefuture = Patterns.ask(jobClientActor,                                           new JobClientMessages.SubmitJobAndWait(jobGraph),                                           newAwait.result(future, AkkaUtils.INF_TIMEOUT());

JobClientWill block waiting for the future return result. After the return result is answer, it is parsed to determine whether it is the result of a Job successful return or a failed return.

Summary

At this point, the key method call path of the client submission streaming job has been combed through. In order to highlight the main route and avoid being disturbed by too much implementation detail, we temporarily overlook the interpretation of some important data structures and key concepts. However, we will analyze them later on.

Scan code Attention public number: Apache_flink

QQ Scan Code concern QQ Group: Apache Flink Learning Exchange Group (123414680)

Apache Flink Stream Job Submission Process analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More