Submitting applications
Scripts in the script in Spark bin
directory are spark-submit
used with the launch application on the cluster. It can use all Spark-supported cluster managers through a single interface, so you don't need to configure your application specifically for each cluster managers.
Packaging app Dependencies
If your code relies on other projects, in order to distribute the code into the Spark cluster you will need to package them with your application. To do this, create a assembly jar (or "Uber" jar) that contains your code and relies on it. Both SBT and Maven have assembly plugins. When creating the Assembly jar, list the dependencies for Spark and Hadoop provided
. They do not need to be packaged because they are already provided by Cluster Manager at run time. If you have a assembled jar, you can invoke bin/spark-submit
the script (shown below) to pass your jar.
For Python, you can use spark-submit
the --py-files
parameters to add .py
, .zip
and .egg
files to distribute with your application. If you rely on more than one Python file, we recommend packaging them as one .zip
or .egg
file.
Start the application with Spark-submit
If the user's application is packaged, it can be started using a bin/spark-submit
script. This script is responsible for setting up spark and its dependent classpath, and can support the different Cluster Manager and deploy mode that spark supports:
--class <main-class> --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> ... # other options <application-jar> [application-arguments]
Some of the most popular options are:
-
--class
: The entry point for your application (for example,. org.apache.spark.examples.SparkPi
)
-
--master
: master url of the cluster ( For example spark://23.195.26.187:7077
)
-
--deploy-mode
: Yes on the worker node ( cluster
) or locally as an external client ( client
), deploy your driver (default: client
) ?
-
--conf
: Use any of the Spark configuration properties in key=value format. Use quotation marks "Key=value" for value (value) that contains spaces.
-
Application-jar
: Includes the path to your app and all dependencies of a packaged jar. The URL must be globally visible on your cluster, for example, a hdfs://
path or a file://
is visible on all nodes.
-
application-arguments
: The parameters of the main method passed to your main class, if any.
? A common deployment strategy is to submit your application from a machine where the gateway machine is physically located with your worker (for example, on the Master node in the standalone EC2 cluster). In this setup, the client
pattern is appropriate. In the client
pattern, driver runs directly within a process that acts as a cluster client spark-submit
. The input and output of the application are connected directly to the console. Therefore, this pattern is especially suitable for applications that design REPL (for example, Spark shell).
In addition, if you submit your application from a machine that is away from the worker machine (for example, on a local laptop), you typically use cluster
patterns to reduce the latency between driver and executor. Currently, the Standalone mode does not support Python applications in Cluster mode.
For Python applications, the <application-jar>
location is simple to pass a .py
file instead of a JAR, and you can --py-files
add Python .zip
, .egg
or .py
file to search path.
Here are some of the options available for a specific cluster manager. For example, Spark standalone cluster with cluster
deployment mode, you can also specify --supervise
to ensure that driver can be restarted automatically when Non-zero exit code fails. In order to list all spark-submit
of the available options, use the --help
. To run it. Here are some examples of common options:
# Run application locally on 8 cores./bin/spark-submit--class Org.apache.spark.examples.SparkPi--master Local[8]/path/to/examples.jar 100# Run on a Spark standalone cluster in client deploy Mode./bin/spark-submit--class Org.apache.spark.examples.SparkPi--master spark://207.184.161.138:7077--executor-memory 20G--total-executor-cores100/path/to/examples.jar 1000# Run on a Spark standalone cluster in cluster deploy mode with Supervise./bin/spark-submit--class Org.apache.spark.examples.SparkPi--master spark://207.184.161.138:7077--deploy-mode Cluster--supervise--executor-memory 20G--total-executor-cores100/path/to/examples.jar 1000# Run on a YARN clusterExportHadoop_conf_dir=xxx./bin/spark-submit--class Org.apache.spark.examples.SparkPi--master yarn--deploy-mode Cluster\# Can is client for client mode--executor-memory 20G--num-executors 50 /path/to/examples.jar 1000# Run a Python application on a Spark standalone cluster./bin/spark-submit --master spark://207.184.161.138:7077 examples/src/main/python/pi.py 1000# Run on a Mesos cluster in cluster deploy mode with Supervise./bin/spark-submit --class org.apache.spark.examples.SparkPi --master mesos://207.184.161.138:7077 --deploy-mode cluster --supervise --executor-memory 20G -- Total-executor-cores 100 Http://path/to/examples.jar 1000
Master URLs
The master URL that is passed to Spark can use one of the following formats:
Master URL |
meaning |
local |
Use a thread to run Spark locally (that is, without parallelism). |
local[K] |
Use a K worker thread to run Spark locally (ideally, set the number of this value to the number of cores for your machine). |
local[K,F] |
Use K worker threads to run spark locally and allow a maximum of failure F times (consult spark.task.maxFailures for an explanation of the variable) |
local[*] |
Use more worker threads as the logical core to run Spark locally on your machine. |
local[*,F] |
Use more worker threads as the logical core to run spark locally on your machine and allow up to F times to fail. |
spark://HOST:PORT |
Connect to the given Spark standalone cluster master. Master The port must have one that is used as your master configuration, and the default is 7077. |
spark://HOST1:PORT1,HOST2:PORT2 |
Connect to the given Spark standalone cluster with standby masters with Zookeeper. The list must contain all master hosts in the highly available cluster set by zookeeper. The port must have one that is used as your master configuration, and the default is 7077. |
mesos://HOST:PORT |
Connect to a given Mesos cluster. The port must have one to use as your configuration, and the default is 5050. Or, for Mesos cluster using ZooKeeper, use mesos://zk://... : Using --deploy-mode cluster , to submit, the Host:port should be configured to connect to the Mesosclusterdispatcher. |
yarn |
Connect to a YARN cluster in client or cluster mode depending on --deploy-mode the value of the. In the client or cluster mode. The location of the cluster will be based on HADOOP_CONF_DIR or YARN_CONF_DIR variable to find. |
Load a configuration from a file
spark-submit
The script can load the default Spark configuration values from a properties file and pass them to your application. By default, it will read the configuration from the Spark directory conf/spark-defaults.conf
. For more details, see Loading the default configuration.
Load the default Spark configuration, which eliminates the need for certain tokens to spark-submit
. For example, if the spark.master
property is set, you can spark-submit
omit the configuration in the safe --master
. In general, the configuration value that is explicitly set on is the SparkConf
highest priority, then the value that is passed to spark-submit
, and finally the value in the default value.
If you are not quite sure where the configuration settings come from, you can --verbose
run spark-submit
print out fine-grained debugging information by using the options.
High-level dependency management
When used spark-submit
, the --jars
jar and any other jars of the application included with the options are automatically transferred to the cluster. The --jars
URLs provided in the following must be separated by commas. The list is included in the classpath of driver and executor. --jars
the form of the catalog is not supported.
Spark uses the following URL format to allow the use of different policies when propagating jars:
- File:-Absolute paths and
file:/
URIs are serviced by driver HTTP file server, and each executor pulls these files from the driver HTTP server.
- HDFs:, http:, https:, ftp:-Pull down the download file and JAR as expected
- Local:-One with local:/ The beginning of the URL is expected to exist on each worker node as a local file. This means that no network IO occurs and is well suited for large file/jar that have been pushed to each worker or shared through Nfs,glusterfs.
n Note that those jars and files are copied to working directory (working directory) for each sparkcontext on the executor node. This can be used up to a significant amount of space over time and will need to be cleaned up. In Spark on YARN mode, the cleanup operation is performed automatically. In Spark standalone mode, you can spark.worker.cleanup.appDataTtl
perform automatic cleanup by configuring properties.
Users can also --packages
provide a comma-delimited maven coordinates (MAVEN coordinate) to contain any other dependencies by using it. All transitive dependencies will be processed when this command is used. Other repository (or parsed in SBT) can be --repositories
added to a comma-delimited style using this tag. (Note that for those libraries that have password protection set up, in some cases, validation information can be provided in the library URL, for example https://user:[email protected]/...
. Use caution in providing validation information in this manner.) These commands can be associated with pyspark
, spark-shell
and the spark-submit
configuration will be used to include spark Packages (Spark package). For Python, you can also use --py-files
options for distribution .egg
, .zip
and .py
libraries to executor.
# More info
If you have already deployed your application, the cluster schema overview describes the components involved in distributed execution and how to monitor and debug your application.
We've been working on it.
Apachecn/spark-doc-zh
Original address: http://spark.apachecn.org/docs/cn/2.2.0/submitting-applications.html
Web address: http://spark.apachecn.org/
Github:https://github.com/apachecn/spark-doc-zh (Feel good trouble for a Star, thank you!) ~
Apache Spark 2.2.0 Chinese Document-Submitting applications | Apachecn