[Spark] Spark Application Deployment Tools Spark-submit__spark

Source: Internet
Author: User
1. Introduction

The Spark-submit script in the Spark Bin directory is used to start the application on the cluster. You can use the Spark for all supported cluster managers through a unified interface, so you do not have to specifically configure your application for each cluster Manager (It can using all Spark ' s supported cluster managers through a uniform I Nterface so don ' t have to configure your application for each one). 2. Grammar

xiaosi@yoona:~/opt/spark-2.1.0-bin-hadoop2.7$ spark-submit--help usage:spark-submit [options] <app jar |
Python file> [app arguments] Usage:spark-submit--kill [submission ID]--master [spark://...]
Usage:spark-submit--status [Submission ID]--master [spark://...] Usage:spark-submit run-example [Options] example-class [Example args] options:--master master_url Spark://hos
  T:port, Mesos://host:port, yarn, or Local. --deploy-mode Deploy_mode Whether to launch the driver program locally ("Client") or on
  E of the worker machines inside the cluster ("cluster") (default:client).
  --class class_name Your application ' s main class (for Java/scala apps).
  --name name A name of your application. --jars jars comma-separated List of local jars to include on the driver and
  Executor classpaths. --packages Comma-separaTed List of Maven coordinates of jars to include the driver and executor classpaths.
                              Would search the local maven repo, then maven and any additional remote Repositories given by--repositories.
  The format for the coordinates should is groupId:artifactId:version. --exclude-packages comma-separated List of Groupid:artifactid, to exclude while Res
  Olving the dependencies provided in--packages to avoid dependency.  --repositories comma-separated List of additional remote repositories to search
  For the MAVEN coordinates given with--packages.  --py-files py_files comma-separated List of. zip,. Egg, or. py files to place on the
  Pythonpath for Python apps. --files files comma-separated LIst of files to is placed in the working directory of each executor.
  --conf prop=value arbitrary Spark configuration property. --properties-file file Path to a file from which to load extra properties.

  If not specified, this'll look for conf/spark-defaults.conf.
  --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (default:1024m).
  --driver-java-options Extra Java options to the driver.
  --driver-library-path Extra Library Path entries to the driver. --driver-class-path Extra class path entries to the driver.
                              Note this jars added with--jars are automatically included in the

  Classpath.

  --executor-memory MEM memory per executor (e.g. 1000M, 2G) (default:1g).
                              --proxy-user NAME user to impersonate when submitting the application. This aRgument does not work with--principal/--keytab.
  --help,-H show the This Help and exit.
  --verbose,-v Print additional debug output.

 --version, Print the version of current Spark. Spark standalone with cluster deploy mode only:--driver-cores NUM cores for driver (default:1). Spark Stand
  Alone or Mesos with cluster deploy mode only:--supervise If Given, restarts the driver on failure.
  --kill submission_id If Given, kills the driver specified.

 --status submission_id If Given, requests the status of the driver specified.

 Spark Standalone and Mesos only:--total-executor-cores NUM total cores to all executors. Spark Standalone and YARN only:--executor-cores NUM number of cores per executor.

 (Default:1 in YARN mode, or all available cores on the worker in standalone mode) Yarn-only:--driver-cores NUM number of cores uSed by the driver cluster mode (DEFAULT:1).--queue queue_name the YARN
  Queue to submit to (default: "Default"). --num-executors num number of executors to launch (default:2). If Dynamic Allocatio
  n is enabled, the initial number of executors would be at least NUM. --archives Archives Comma separated list of archives to is extracted into the Workin
  G directory of each executor. --principal principal principal to is used to login to KDC, while running on secure HD
  Fs. --keytab keytab the ' full path ' to the ' file that contains ' keytab for the Princ Ipal specified above.
                              This keytab is copied to the node running the application Master via the Secure Distributed Cache, for renewing the login Tickets and the delegation tokens periodically. 
3. Bundled Application dependencies

If your code relies on other projects, you need to package them with your application to distribute the code to the Spark cluster. To do this, create an assembly jar (or Uber jar) that contains the code and its dependencies. Both SBT and Maven have assembly Plug-ins. When creating jars, List Spark and Hadoop as dependencies that need to be provided; These do not need to be bundled because they are provided by the cluster manager at run time. Once you have a jar, you can call the Bin/spark-submit script, as shown below, while passing your jar as an argument.

For Python, you can use the Spark-submit--py-files parameter to add. py,.zip or. Egg files to be distributed with the application. If you rely on multiple Python files, we recommend that they be packaged into a. zip or. egg file. 4. Start the application using Spark-submit

Once the user application is packaged successfully, you can use the Bin/spark-submit script to start the application. This script is responsible for setting up spark classpath and its dependencies, and can support different cluster managers and deployment modes (supported by Spark):

./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode < deploy-mode> \
  --conf <key>=<value> \ ...
  # Other options
  <application-jar> \
  [ Application-arguments]

Some common options:--class the application portal (for example: Com.sjf.open.spark.Java.JavaWordCount full path name containing the package name)--master The primary URL of the cluster (for example: spark:// 23.195.26.187:7077)--deploy-mode where driver is deployed, the client or cluster Application-jar the jar path that contains the application and all dependencies. URLs must be globally visible within the cluster, for example, hdfs://paths or file://paths that exist on all nodes. Application-arguments the arguments passed to the main class Main method (if any)

If you submit your application's machine away from the work node machine (for example, by submitting it locally on your laptop), you typically use cluster mode to minimize network latency between drivers and executors. Currently, cluster mode is not supported in standalone mode for Python applications.

For a python application, simply pass a. py file to the <application-jar> location instead of the jar, and then use the--py-files parameter CA to add the Python. Zip,.egg or. py file to the search path.

There are several options available that are specific to the cluster manager. For example, for a spark stand-alone cluster with a clustered deployment pattern, you can specify the--supervise parameter to ensure that if driver fails with a non-0 exit, it restarts automatically. If you want to enumerate all available options for spark-submit, you can view them using the Spark-submit--help command. Here are a few examples of common options:

# Run 8 cores locally./bin/spark-submit \--class org.apache.spark.examples.SparkPi \--master local[8] \/PATH/TO/EXAMPLES.J AR \ 100 # runs on the Spark standalone cluster in client deployment mode./bin/spark-submit \--class org.apache.spark.examples.SparkPi \--master spark:// 207.184.161.138:7077 \--executor-memory 20G \--total-executor-cores \/path/to/examples.jar \ 1000 # in cluster deployment module Use supervise to run on the Spark standalone cluster./bin/spark-submit \--class org.apache.spark.examples.SparkPi \--master spark://207.184. 161.138:7077 \--deploy-mode cluster \--supervise \--executor-memory 20G \--total-executor-cores To/examples.jar \ 1000 # Run export hadoop_conf_dir=xxx on the YARN cluster./bin/spark-submit \--class Org.apache.spark.exampl Es. SPARKPI \--master yarn \--deploy-mode cluster \ # can be client for client mode--executor-memory 20G \--NUM-E xecutors \/path/to/examples.jar \ 1000 # Run the Python program on the Spark standalone cluster./bin/spark-submit \--master spark://207.184. 161.138:7077 \ Examples/src/main/python/pi.py \ 1000 # runs on the Mesos cluster using supervise in cluster deployment mode./bin/spark-submit \--class Org.apache.spark.examp Les.
  SPARKPI \--master mesos://207.184.161.138:7077 \--deploy-mode cluster \--supervise \--executor-memory 20G \ --total-executor-cores \ Http://path/to/examples.jar \ 1000
5. Master Urls

The master URL passed to spark can be in the following format:

Master URL Description
Local Run spark locally using a worker thread
LOCAL[K] Run spark locally using K worker threads (ideally, set it to the number of cores on the machine).
Local[*] Run spark locally using as many worker threads as the logical cores on the computer.
Spark://host:port Connect to the given Spark standalone cluster host. The port must be a port that the host configuration can use, by default 7077.
Mesos://host:port Connect to a given Mesos cluster. The port must be a port that the host configuration can use, and the default is 5050. Or, for a mesos cluster using zookeeper, use mesos://zk://.... To submit using--deploy-mode cluster.
Yarn Whether to connect to the yarn cluster in client-mode or cluster mode depends on the value of the--deploy-mode. Cluster location can be found based on Hadoop_conf_dir or yarn_conf_dir variables
6. Load configuration from File

The Spark-submit script can load the default spark configuration options from the properties file and pass them to the application. By default, Spark reads configuration options from the conf/spark-defaults.conf configuration file in the Spark directory. For more details, read the load default configuration.

Loading the default spark configuration in this manner avoids adding configuration options on the Spark-submit. For example, if the Spark.master property is set in the default configuration file, you can safely omit the--master parameter from the Spark-submit. In general, configuration options that are explicitly set on sparkconf have the highest priority, then the configuration options that are passed to Spark-submit, and then the configuration options in the default configuration file.

If you are not sure where the configuration options come from, you can run Spark-submit to print out fine-grained debugging information by using the--verbose option. 7. Advanced Dependency Management

When using Spark-submit, the application jar and the jar contained in the-jars option are automatically transferred to the cluster. The list of URLs that are provided after--jars must be separated by commas. The list is included in the classpath of driver and executor. Directory extensions cannot be used with--jars.

Spark uses the following URL scheme to propagate transport jar:file with different policies: absolute path and file:/ The URI is provided by the driver HTTP file server, and each executor pulls the file from the driver HTTP server. HDFs:, http:, https:, ftp: As you would expect, these pull files and jars from the URI

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.