Apache Spark 2.2.0 Chinese Document-Submitting applications

Apache Spark 2.2.0 Chinese Document-Submitting applications | Apachecn

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Submitting applications

Scripts in the script in Spark bin directory are spark-submit used with the launch application on the cluster. It can use all Spark-supported cluster managers through a single interface, so you don't need to configure your application specifically for each cluster managers.

Packaging app Dependencies

If your code relies on other projects, in order to distribute the code into the Spark cluster you will need to package them with your application. To do this, create a assembly jar (or "Uber" jar) that contains your code and relies on it. Both SBT and Maven have assembly plugins. When creating the Assembly jar, list the dependencies for Spark and Hadoop provided . They do not need to be packaged because they are already provided by Cluster Manager at run time. If you have a assembled jar, you can invoke bin/spark-submit the script (shown below) to pass your jar.

For Python, you can use spark-submit the --py-files parameters to add .py , .zip and .egg files to distribute with your application. If you rely on more than one Python file, we recommend packaging them as one .zip or .egg file.

Start the application with Spark-submit

If the user's application is packaged, it can be started using a bin/spark-submit script. This script is responsible for setting up spark and its dependent classpath, and can support the different Cluster Manager and deploy mode that spark supports:

  --class <main-class>   --master <master-url>   --deploy-mode <deploy-mode>   --conf <key>=<value>  ... # other options <application-jar>  [application-arguments]

Some of the most popular options are:

--class : The entry point for your application (for example,. org.apache.spark.examples.SparkPi )
--master : master url of the cluster ( For example spark://23.195.26.187:7077 )
--deploy-mode : Yes on the worker node ( cluster ) or locally as an external client ( client ), deploy your driver (default: client ) ?
--conf : Use any of the Spark configuration properties in key=value format. Use quotation marks "Key=value" for value (value) that contains spaces.
Application-jar : Includes the path to your app and all dependencies of a packaged jar. The URL must be globally visible on your cluster, for example, a hdfs:// path or a file:// is visible on all nodes.
application-arguments : The parameters of the main method passed to your main class, if any.

? A common deployment strategy is to submit your application from a machine where the gateway machine is physically located with your worker (for example, on the Master node in the standalone EC2 cluster). In this setup, the client pattern is appropriate. In the client pattern, driver runs directly within a process that acts as a cluster client spark-submit . The input and output of the application are connected directly to the console. Therefore, this pattern is especially suitable for applications that design REPL (for example, Spark shell).

In addition, if you submit your application from a machine that is away from the worker machine (for example, on a local laptop), you typically use cluster patterns to reduce the latency between driver and executor. Currently, the Standalone mode does not support Python applications in Cluster mode.

For Python applications, the <application-jar> location is simple to pass a .py file instead of a JAR, and you can --py-files add Python .zip , .egg or .py file to search path.

Here are some of the options available for a specific cluster manager. For example, Spark standalone cluster with cluster deployment mode, you can also specify --supervise to ensure that driver can be restarted automatically when Non-zero exit code fails. In order to list all spark-submit of the available options, use the --help . To run it. Here are some examples of common options:

# Run application locally on 8 cores./bin/spark-submit--class Org.apache.spark.examples.SparkPi--master Local[8]/path/to/examples.jar 100# Run on a Spark standalone cluster in client deploy Mode./bin/spark-submit--class Org.apache.spark.examples.SparkPi--master spark://207.184.161.138:7077--executor-memory 20G--total-executor-cores100/path/to/examples.jar 1000# Run on a Spark standalone cluster in cluster deploy mode with Supervise./bin/spark-submit--class Org.apache.spark.examples.SparkPi--master spark://207.184.161.138:7077--deploy-mode Cluster--supervise--executor-memory 20G--total-executor-cores100/path/to/examples.jar 1000# Run on a YARN clusterExportHadoop_conf_dir=xxx./bin/spark-submit--class Org.apache.spark.examples.SparkPi--master yarn--deploy-mode Cluster\# Can is client for client mode--executor-memory 20G--num-executors 50 /path/to/examples.jar  1000# Run a Python application on a Spark standalone cluster./bin/spark-submit --master spark://207.184.161.138:7077  examples/src/main/python/pi.py  1000# Run on a Mesos cluster in cluster deploy mode with Supervise./bin/spark-submit --class org.apache.spark.examples.SparkPi --master mesos://207.184.161.138:7077 --deploy-mode cluster --supervise --executor-memory 20G -- Total-executor-cores 100  Http://path/to/examples.jar  1000

Master URLs

The master URL that is passed to Spark can use one of the following formats:

Master URL	meaning
`local`	Use a thread to run Spark locally (that is, without parallelism).
`local[K]`	Use a K worker thread to run Spark locally (ideally, set the number of this value to the number of cores for your machine).
`local[K,F]`	Use K worker threads to run spark locally and allow a maximum of failure F times (consult spark.task.maxFailures for an explanation of the variable)
`local[*]`	Use more worker threads as the logical core to run Spark locally on your machine.
`local[*,F]`	Use more worker threads as the logical core to run spark locally on your machine and allow up to F times to fail.
`spark://HOST:PORT`	Connect to the given Spark standalone cluster master. Master The port must have one that is used as your master configuration, and the default is 7077.
`spark://HOST1:PORT1,HOST2:PORT2`	Connect to the given Spark standalone cluster with standby masters with Zookeeper. The list must contain all master hosts in the highly available cluster set by zookeeper. The port must have one that is used as your master configuration, and the default is 7077.
`mesos://HOST:PORT`	Connect to a given Mesos cluster. The port must have one to use as your configuration, and the default is 5050. Or, for Mesos cluster using ZooKeeper, use `mesos://zk://...` : Using `--deploy-mode cluster` , to submit, the Host:port should be configured to connect to the Mesosclusterdispatcher.
`yarn`	Connect to a YARN cluster in `client` or `cluster` mode depending on `--deploy-mode` the value of the. In the client or cluster mode. The location of the cluster will be based on `HADOOP_CONF_DIR` or `YARN_CONF_DIR` variable to find.

Load a configuration from a file

spark-submitThe script can load the default Spark configuration values from a properties file and pass them to your application. By default, it will read the configuration from the Spark directory conf/spark-defaults.conf . For more details, see Loading the default configuration.

Load the default Spark configuration, which eliminates the need for certain tokens to spark-submit . For example, if the spark.master property is set, you can spark-submit omit the configuration in the safe --master . In general, the configuration value that is explicitly set on is the SparkConf highest priority, then the value that is passed to spark-submit , and finally the value in the default value.

If you are not quite sure where the configuration settings come from, you can --verbose run spark-submit print out fine-grained debugging information by using the options.

High-level dependency management

When used spark-submit , the --jars jar and any other jars of the application included with the options are automatically transferred to the cluster. The --jars URLs provided in the following must be separated by commas. The list is included in the classpath of driver and executor. --jarsthe form of the catalog is not supported.

Spark uses the following URL format to allow the use of different policies when propagating jars:

File:-Absolute paths and file:/ URIs are serviced by driver HTTP file server, and each executor pulls these files from the driver HTTP server.
HDFs:, http:, https:, ftp:-Pull down the download file and JAR as expected
Local:-One with local:/ The beginning of the URL is expected to exist on each worker node as a local file. This means that no network IO occurs and is well suited for large file/jar that have been pushed to each worker or shared through Nfs,glusterfs.

n Note that those jars and files are copied to working directory (working directory) for each sparkcontext on the executor node. This can be used up to a significant amount of space over time and will need to be cleaned up. In Spark on YARN mode, the cleanup operation is performed automatically. In Spark standalone mode, you can spark.worker.cleanup.appDataTtl perform automatic cleanup by configuring properties.

Users can also --packages provide a comma-delimited maven coordinates (MAVEN coordinate) to contain any other dependencies by using it. All transitive dependencies will be processed when this command is used. Other repository (or parsed in SBT) can be --repositories added to a comma-delimited style using this tag. (Note that for those libraries that have password protection set up, in some cases, validation information can be provided in the library URL, for example https://user:[email protected]/... . Use caution in providing validation information in this manner.) These commands can be associated with pyspark , spark-shell and the spark-submit configuration will be used to include spark Packages (Spark package). For Python, you can also use --py-files options for distribution .egg , .zip and .py libraries to executor.

# More info

If you have already deployed your application, the cluster schema overview describes the components involved in distributed execution and how to monitor and debug your application.

We've been working on it.

Apachecn/spark-doc-zh

Original address: http://spark.apachecn.org/docs/cn/2.2.0/submitting-applications.html
Web address: http://spark.apachecn.org/
Github:https://github.com/apachecn/spark-doc-zh (Feel good trouble for a Star, thank you!) ~

Apache Spark 2.2.0 Chinese Document-Submitting applications | Apachecn

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More