How do I submit a spark task in a Java app?

Source: Internet
Author: User

Recently saw a few GitHub friends concerned about the streaming monitoring project--teddy, so reasoning or optimize the code, not to let others see jokes, is not. So I wanted to change it to one of the ugliest places before--task submission

This blog content is based on the Spark2.2 version ~ before you read the article and want to actually do it, make sure you have:

    1. A server with spark and yarn configured
    2. Support spark-submit --master yarn xxxx for normal task submission
Old version

The old version of the task submission is done in a way that starts the local process and executes the script spark-submit xxx . One of the key issues is getting the Application-id to submit the spark task, because the ID is related to the tracking of the task status. If your resource management framework uses yarn, you should know that each running task has a applicaiton_id, and that the generation rule for this ID is:

appplication_时间戳_数字

The old version of Spark spark.app.id can manually specify the ID by parameter, but the new version of the code is not. It is a direct read of the taskbackend in the ApplicationID () method, the implementation of this method is based on the implementation of the class to be determined. In yarn, it is realized through the yarnclusterschedulerbackend of yarn, the connection can refer to the corresponding link.

Interested students can look at the end of the real common applicaiton_id in the Hadoop-yarn inside, there is a called Containerid to generate.

The bottom line is, want to customize the ID, don't think!!!!

So when the brain is not a good idea of me, I think that just wait for the application to be created, directly write to the database inside Bai. How do you write it?

    1. I generated a custom ID in advance and passed it as a parameter to the spark application;
    2. Once spark is initialized, the corresponding application_id and URLs can be obtained via sparkcontext.
    3. Then driver connect to the database and insert an association relationship

New version

Or is due to the internet era of information transmission, I saw a group of friends chatting, know sparklauncer this thing, the investigation found. He can automatically submit spark tasks based on Java code in two modes:

    1. New Sparklauncher (). Launch () Start a process directly with the same effect as before
    2. New Sparklauncher (). Startapplicaiton (Listener) returns a Sparkapphandler and (optionally) passes in a listener

Of course the second kind is preferred, because the benefits are many:

    1. Self-contained output redirection (Output,error, support write to file), Super Cool feature
    2. You can customize the listener to operate when information or status changes (not for me)
    3. The returned Sparkapphandler supports pause, stop, disconnect, get AppID, get state and more, I want this!!!!

Step-by-step, code show

Start by creating a basic Spark program:

import org.apache.spark.sql.SparkSession;import java.util.ArrayList;import java.util.List;public class HelloWorld {    public static void main(String[] args) throws InterruptedException {        SparkSession spark = SparkSession                .builder()                //.master("yarn")                //.appName("hello-wrold")                //.config("spark.some.config.option", "some-value")                .getOrCreate();        List<Person> persons = new ArrayList<>();        persons.add(new Person("zhangsan", 22, "male"));        persons.add(new Person("lisi", 25, "male"));        persons.add(new Person("wangwu", 23, "female"));        spark.createDataFrame(persons, Person.class).show(false);        spark.close();    }}

Then create the Sparklauncher class:

Import Org.apache.spark.launcher.sparkapphandle;import Org.apache.spark.launcher.sparklauncher;import Java.io.ioexception;public class Launcher {public static void main (string[] args) throws IOException {Sparkapp                Handle handler = new Sparklauncher (). Setappname ("Hello-world"). Setsparkhome (Args[0]) . Setmaster (Args[1]). setconf ("spark.driver.memory", "2g"). Setconf ("Spark.executor . Memory "," 1g "). Setconf (" Spark.executor.cores "," 3 "). Setappresource ("/home/xinghailong/lau                Ncher/launcher_test.jar "). Setmainclass (" HelloWorld "). Addappargs (" I come from Launcher ")                    . Setdeploymode ("cluster"). Startapplication (New Sparkapphandle.listener () { @Override public void statechanged (Sparkapphandle handle) {System.out.println ("   State changed ********** ");                 } @Override public void infochanged (Sparkapphandle handle) {                    System.out.println ("********** info changed **********");        }                }); while (! " Finished ". Equalsignorecase (Handler.getstate (). toString ()) &&!"            FAILED ". Equalsignorecase (Handler.getstate (). toString ())) {SYSTEM.OUT.PRINTLN (" id "+handler.getappid ());            SYSTEM.OUT.PRINTLN ("state" +handler.getstate ());            try {thread.sleep (10000);            } catch (Interruptedexception e) {e.printstacktrace (); }        }    }}

Then package the project, the packaging process can refer to the previous blog:
Http://www.cnblogs.com/xing901022/p/7891867.html

After packaging is complete, upload to the server where spark is deployed. Since the class where the sparklauncher is referencing the Sparklauncher, it is also necessary to upload the jar to the server.

[[email protected] launcher]$ lslauncher_test.jar  spark-launcher_2.11-2.2.0.jar[[email protected] launcher]$ pwd/home/xinghailong/launcher

Since Sparklauncher needs to specify Spark_home, so if your machine can perform spark-submit, look inside Spark-submit, where is Spark_home

[[email protected] launcher]$ which spark2-submit/var/lib/hadoop-hdfs/bin/spark2-submit

In the last few lines you can see:

export SPARK2_HOME=/var/lib/hadoop-hdfs/app/spark# disable randomized hash for string in Python 3.3+export PYTHONHASHSEED=0exec "${SPARK2_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "[email protected]"

In conclusion, what we need is:

    1. A custom jar containing the spark application and the Sparklauncher class
    2. A sparklauncher version of Jar,spark-launcher_2.11-2.2.0.jar, according to your own.
    3. A path to the current directory
    4. A directory specified by a sark_home environment variable

Then execute the command to start the test:

java -Djava.ext.dirs=/home/xinghailong/launcher -cp launcher_test.jar Launcher /var/lib/hadoop-hdfs/app/spark yarn

Description

    1. -Djava.ext.dirsSets the directory that the current directory loads for Java classes
    2. Pass in two parameters, one is spark_home; one is the boot mode

Observe the delete discovery successfully started running:

ID nullstate Unknownmar, 2018 12:00:52 PM org.apache.spark.launcher.OutputRedirector REDIRECTINFO:18/03/10 12:00:52 WARN util. nativecodeloader:unable to load Native-hadoop library for your platform ... using Builtin-java classes where applicable*** State changed ********** ... Omit a large heap of copy jar logs ********** info changed ******************** state changed **********mar, 2018 12:00:55 PM Org.apach E.spark.launcher.outputredirector redirectinfo:18/03/10 12:00:55 INFO yarn. Client:application Report for application_1518263195995_37615 (state:accepted) ... Omit a heap of redirected logs application_1518263195995_37615 (state:accepted) ID application_1518263195995_37615state SUBMITTEDMar 10, 2018 12:01:00 PM org.apache.spark.launcher.OutputRedirector redirectinfo:18/03/10 12:01:00 INFO yarn. Client:application application_1518263195995_37615 (state:running) ********** state changed ********** ... Slightly a bunch of redirected logs info:user:hdfs********** state changed **********mar 10, 2018 12:01:08 PM org.apache.spark.launcher.OutputRedirector redirectinfo:18/03/10 12:01:08 INFO util. Shutdownhookmanager:shutdown Hook Calledmar, 2018 12:01:08 PM Org.apache.spark.launcher.OutputRedirector REDIRECTINFO:18/03/10 12:01:08 INFO util. Shutdownhookmanager:deleting directory/tmp/spark-f07e0213-61fa-4710-90f5-2fd2030e0701
Summarize

This enables the submission of spark tasks based on Java applications, and the need for their appliation_id and status for location tracking.

How do I submit a spark task in a Java app?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.