Flink on Yarn mode Startup Process source code analysis

Source: Internet
Author: User

This article has been published by the author Yue Meng to authorize the Netease cloud community.

Welcome to the Netease cloud community to learn more about the operation experience of Netease technology products.


For the flink on Yarn startup process, refer to the flink on Yarn Startup Process in the previous article. The following describes the implementation from the source code perspective. It may be incorrect in some cases. please correct me. Thank you.

--> 1. Run the command line to start the yarn session.

Bin/yarn-session.sh-N 3-JM 1024-nm 1024-St
Let's take a look at the startup script.

  $JAVA_RUN $JVM_ARGS -classpath "$CC_CLASSPATH:$HADOOP_CLASSPATH:$HADOOP_CONF_DIR:$YARN_CONF_DIR" $log_setting  org.apache.flink.yarn.cli.FlinkYarnSessionCli  -j "$FLINK_LIB_DIR"/flink-dist*.jar "[email protected]"

It is mainly used to start the main class *** Org. apache. flink. yarn. CLI. flinkyarnsessioncli *, [email protected] is the input parameter "-N 3-JM 1024-nm 1024-ST "**.

1. flinkyarnsessioncli Startup Process Analysis

First, let's look at the main function.

public static void main(String[] args) {   FlinkYarnSessionCli cli = new FlinkYarnSessionCli("", ""); // no prefix for the YARN session   System.exit(cli.run(args));}

It is mainly to construct flinkyarnsessioncli and then execute its run method. Here we mainly introduce the code of the main process.

Public int run (string [] ARGs)

  • 1. parse command line parameters
    cmd = parser.parse(options, args)
  • 2. The execution mode is determined based on the command line parameters.
    # First, determine whether the command contains-Q



** Example :**


# Second, determine whether the-ID parameter exists


Here we can see what the interaction mode looks like. There are two options: Help and stop. If we press stop, all the processes corresponding to the application will exit.


# The third type is normal mode.




** Here we mainly construct yarnclusterdescriptor, call its deploy method to start the cluster, and then write jobmanager and web UI address to the out file. If the separation mode is used, then, wait for the yarn session to automatically exit after the cluster is started. If not, enter the interactive mode. We can control the applitcation through interaction **


Next, let's take a look at how to construct yarnclusterdescriptor.

----------------- ** 1 creat yarnclusterdescriptor **----------------------

Directly Add the new yarnclusterdescriptor object, set the dependent jar address, configuration parameters such as the number of taskmanagers, Jar address, configuration file address, and configuration parameters to the yarnclusterdescriptor object, and return this object.

------------ ** 2 yarnclusterdescriptor deploy **-------------------------

Because yarnclusterdescriptor does not override the depoy method, it directly calls the deploy method of its parent class, but ultimately calls its deployinternal method.

Next, let's take a look at the deployinternal method, briefly describe the process, and the GitHub address for subsequent code analysis.

  • Check whether the deploy conditions are met, such as whether the configuration file and jar path are empty.

  • Obtain the yarn client, and the user communicates with RM.

  • Add dynamic configuration properties to the configuration conf object, and parse the configuration conf object to kV pairs.

  • Obtain HDFS filesyetem, which is used to upload local jar and configuration files to HDFS,

  • Determine whether the resources requested by jobmanager and taskmanager meet the minimum allocation of a single container allocated by yarn. If the value is smaller than the value, the minimum allocation of container is used to initialize jobmananger and taskmananer.

  • Create an application using yarn client and return the getnewapplicationresponse object for RPC communication with RM.

  • Get the maximum resources that Rm can allocate to this application through the getnewapplicationresponse object. If the maximum resource cannot meet jobmanagermemorymb or taskmanagermemorymb, an error is returned, calculate the total resources required by jobmanager and all taskmanagers (jobmanagermemorymb + taskmanagermemorymb * taskmanagercount), calculate the total idle resources in RM, and determine whether the idle resources meet the requirements of the previous calculation, if not, the yarn session may be started first, and the task manager waits for resources to start. The task manager allocates a nm to the jobmanager first, and then starts taskmanager on other NM.

  • Set the lanchcontext for starting applicationmaster. Here we mainly set the Java home, main class, JVM parameter count, and log file configuration. The main class yarnapplicationmasterrunner ** of applicationmaster **.

Protected class <?> Getapplicationmasterclass () {return yarnapplicationmasterrunner. class;} '''-set applicationsubmissioncontext, obtain applicationid-set the HDFS path required by the session, and then set the local jar package and configuration file, upload the configuration file to HDFS-set the token information for am startup, set the jar and configuration files that need to be downloaded from HDFS during am startup, set the classpath of applicationmaster, flink, and other processes. Do not mention-set the hook function to clean the files uploaded to HDFS and locally downloaded dependent files during deploy.-*** key points, submit applicaiton to RM; set the application status to new, and then monitor the application. If it is not in the new status, the current status is printed. If the running status is out of this cycle If it is in another State, the yarndeploymentexception exception is thrown. The upper layer calls the capture process. Otherwise, the ***-depoly command is successful once within ms, and the hook function deletes the temporary file, for example, dependent jar packages and configuration files. The returned yarnclusterclient object contains important attributes such as yarnclusterdescriptor and applicationreport. * ***** After a successful deploy enters the interaction mode, the most important step in runinteractivecli is to construct an applicationclient actor for communication with the jobmanager actor. However, if you send messages such as registerinfomessagelistener and unregisterinfomessagelist, the jobmanager actor routes the forward method to the flink Resource Manager actor for processing. In this case, jobmanager acts as the proxy of the flink resource manager. When the two messages are received, the forward method is used, the sender is still the application client actor, so the flink Manager resource actor can directly return a message to the application client *****> --- --------- ** 3 main code display process **------! [] (// Upload-images.jianshu.io/upload_images/3249301-d22456f0939a8365.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-9c80aa18467d4e10.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-85a2f462ff96e5fd.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-a3c81e3dc9b23db0.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-bf%e6a72%f0d.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-57eb01f090d38dd3.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-d548d544dbd1b713.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-2013feca33032c46.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-d0d8c8c1a56f28ff.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-be3a228edeffad9d.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240) *** ---- communication code between applicationclient and jobmanager actor --***! [] (// Upload-images.jianshu.io/upload_images/3249301-56371ec18930ba4f.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-ed28091d44dc3906.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-1a0df11e1a57941d.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-913d82bf6d5825b8.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-6309a49886d0cc4e.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-47f86bcfdee4f967.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-6a483d4af26931cc.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-4de478f435cb1356.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-27986c3659bd96cc.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-a27f89acbe3406de.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-f04a3da4f97a08dc.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240) #2. yarnapplicationmasterrunner Startup Process Analysis *** RM first allocates a NM container to start yarnapplicationmasterrunner. Next, let's look at how it works. *** first, it enters the main function, construct a yarnapplicationmasterrunner object and call its run method directly.> Main steps of the run method-get the ugi of the current user and the remote ugi-pass the token in the ugi of the current user to the remote ugi, for data and service access-run runapplicationmaster in the remote ugi to start applicationmaster! [] (// Upload-images.jianshu.io/upload_images/3249301-7d3ac2af1bf091f6.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240)> The main process of runapplicationmaster. The annotations here are clear. I only pick up the important tips-1) load and parse/validate all deployments-2) Start the actor system, try to start the actor system, jobmanager and jobmanager actor system-3) generate the configuration for the taskmanagers, here is the jobmanager address, the timeout time for taskmanager registration, and the number of slots. The most important step here is to construct the taskinerlaunchcontext of TaskManager, which contains Start the TaskManager startup command. The main class is yarntaskmanager ***. -Start the actors and components in this order: 1) jobmanager & archive (in non-ha case, the leader service takes this), start jobmanageractor, the main class here is *** yarnjobmanager *** 2) Web Monitor (we need its port to register) to start the Web Monitoring page and create a leaderretrievalservice object. This is mainly used to start TASKMANAGER, tell taskmanager jobmanager to obtain the akka URL for taskmanager to register with jobmanager. 3) The Resource master for yarn starts the yarnflinkresourcemanager actor, which is mainly used for flink container resource management, including application and release. 4) process reapers for the jobmanager and Resource Master *** here we mainly introduce how yarnapplicationmasterrunner can use yarnflinkresourcemanager to apply for container and start TASKMANAGER. This is relatively complicated, I followed the yarn code to understand it ***! [Inheritance relationship of yarnflinkresourcemanager] (// upload-images.jianshu.io/upload_images/3249301-cbc8215f8c356913.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240) indicates that yarnflinkresourcemanager is actually an actor. In the runapplicationmaster method, use the following code to start this actor.

Props resourcemasterprops = yarnflinkresourcemanager. createactorprops (
Getresourcemanagerclass (), // yarnflinkresourcemanager
Config,
Yarnconfig,
Leaderretriever,
Appmasterhostname,
Webmonitorurl,
Taskmanagerparameters,
Taskmanagercontext,
Numinitialtaskmanagers,
Log );
Actorref resourcemaster = actorsystem. actorof (resourcemasterprops); // start yarnflinkresourcemanager actor

Next, let's take a look at the yarnflinkresourcemanager constructor method. Here there are three major member variables that are important.

// The RM end of yarn calls the callback function of the returned object to apply for container. resourcemanagercallbackhandler only has the actor ref, so it can communicate with the actor during the callback process.
/** Callback handler for the asynchronous resourcemanagerclient/
Private yarnresourcemanagercallbackhandler resourcemanagercallbackhandler;
// The client in which am communicates with RM. The resourcemanagerclient object holds resourcemanagercallbackhandler.
/* Client to communicate with the Resource Manager (yarn's master )/
Private amrmclientasync <amrmclient. containerrequest> resourcemanagerclient;
// AM and NM communication Client
/* Client to communicate with the Node Manager and launch taskmanager processes */
Private nmclient nodemanagerclient;

When yarnflinkresourcemanager is started, the prestart method is executed first. If it is not implemented, the prestart method of its parent class flinkresourcemanager is executed. Call the initialize method of yarnflinkresourcemanager.> * ** In the initialize method, ***** resourcemanagerclient. start () ----> amrmclientasyncimpl. servicestart () ---> callbackhandlerthread. start () (daemthread) ---> yarnresourcemanagercallbackhandler. oncontainersallocated (allocated) ---> yarnframeworkmaster. tell (New containersallocated (containers), actorref. nosender () (yarnframeworkmaster is yarnflinkresourcemanager actorref) --> yarnflinkresourcemanager. containersallocated --> Nmclient. startcontainer (taskiner, taskmanagerlaunchcontext) notifies each nm to start the container. ***! [] (// Upload-images.jianshu.io/upload_images/3249301-3a5b6377e93742b9.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-2c532578dc7a508d.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-6a2cd3067970fe39.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-e14cfee9289c57bf.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-2e3b73f252030ea8.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-a3239382a6bde777.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-0a52a463c263bbd7.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-52cecffb2670cc67.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240) *** at this point, the important process of yarnapplicationmasterrunner has been completed. If there are too many details, we will not talk about it any more, next, let's take a look at the yarntaskmanager section *** #3. yarntaskmanager Startup Process Analysis ** is connected to nodemanagerclient. startcontainer (taskiner, taskmanagerlaunchcontext) notifies nm to start container. nm downloads the jar and configuration files (container_tokens default_contain) That yarntaskmanager depends on during startup from HDFS according to the startup information of taskmanagerlaunchcontext. Er_executor.sh flink-conf.yaml flink. jar launch_container.sh lib log4j. properties logback. XML), then shell executes launch_container.sh, and finally starts the yarntaskmanager process with Java-CP. when starting the process, first run the yarntaskmanager run method. taskmanager obtains the akka address of jobmanager and then sends the registration message, after receiving the registration message, jobmanager sends an ACK confirmation message to taskmanager after successful registration. Then, taskmanger constructs some member variables that actually work according to the configuration and information returned by jobmanager. Process: ***> yarntaskmanagerrunner. runyarntaskmanager (ARGs, classof [yarntaskmanager]) --> TASKMANAGER. selectnetworkinterfaceandruntaskmanager (configuration, resourceid, TASKMANAGER) --> TASKMANAGER. runtaskmanager --> TASKMANAGER. starttaskmanagercomponentsandactor --> actorsystem. actorof (tmprops, actorname) --> TASKMANAGER. prestart --> standaloneleaderretrievalservice. start (taskmanager) --> taskmanger. notifyleaderadd RESS --> TASKMANAGER. Submit --> TASKMANAGER. triggertaskmanagerregistration () TASKMANAGER. handleregistrationmessage --> instancemanager. registertaskmanager --> jobmanager sends a message to taskmanagertaskmanager. Submit -->! [] (// Upload-images.jianshu.io/upload_images/3249301-8797abe4fead540e.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-6ada%f0e25718b.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-7d9e8a6893af056a.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-e4108544f6c7a677.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-03e551a6ebf785c1.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! Export paste_image.png] (// upload-images.jianshu.io/upload_images/3249301-155d80ccd8b5bed4.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! Export paste_image.png] (// upload-images.jianshu.io/upload_images/3249301-40bbe88a86f090c8.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! Export paste_image.png] (// upload-images.jianshu.io/upload_images/3249301-2226948e829a5add.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! Export paste_image.png] (// upload-images.jianshu.io/upload_images/3249301-b9c444d706737b6f.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-3ff9f4e9243c1264.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-08823728261241ec.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [Associatewithjobmanager] (// upload-images.jianshu.io/upload_images/3249301-f3aa8048869975c8.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-1bb10a80436aa199.png? Imagemogr2/auto-orient/strip % 7cimageview2/2/W/1240 )! [] (// Upload-images.jianshu.io/upload_images/3249301-442effae29613f2c.png.




Link: https://www.jianshu.com/p/8a3177095072


Free trial of cloud security (yundun) content security, verification code and other services


For more information about Netease technologies, products, and operations, click.




Related Articles:
[Recommended] virtualenv introduction and a tough scrapy Installation Method

Flink on Yarn mode Startup Process source code analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.