Spark1.0.0 Application Deployment Tool Spark-submit

Last Update:2015-04-28 Source: Internet

Author: User

Tags hdfs dfs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original link: http://blog.csdn.net/book_mmicky/article/details/25714545

As the application of spark becomes more widespread, the need for support for multi-Explorer application deployment Tools is becoming increasingly urgent. Spark1.0.0, the problem has been gradually improved. Starting with S-park1.0.0, Spark provides an easy-to-Start Application Deployment Tool Bin/spark-submit that enables quick deployment of spark applications on local, Standalone, YARN, Mesos.

1: Use instructions Enter $spark_home directory, enter Bin/spark-submit--help to get help with this command. [email protected]:/app/hadoop/spark100$ bin/spark-submit--helpusage:spark-submit [options] <app jar | Python file> [app options]options: --master master_url Spark://host:port, Mesos://ho St:port, yarn, or local. --deploy-mode deploy_mode driver run, client running on native, cluster running in cluster --class class _name app packages to run class --name NAME Application name --jars jars comma-separated list of driver local jar packages and executor class path --py-files py_files comma-separated list of. zip,. Egg,. py files placed on the Python application Pythonpath --files F Iles comma-separated list of files to be placed in each executor working directory --properties-file file &N Bsp Set the file placement location for application properties, default is conf/spark-defaults.conf --driver-memoRy MEM driver memory size, default 512m --driver-java-options driver Java options --driver-library-path Driver Library path extra libraries path entries to pass to the driver --driver-clas S-path driver classpath, jar packages added with--jars are automatically included in the Classpath --executor-memory MEM E Xecutor memory size, default 1G
Spark standalone with cluster deploy mode only:--driver-cores NUM driver uses the number of cores, default is 1--supervise If this parameter is set, driver failure will restart
Spark Standalone and Mesos only: Total number of cores used by--total-executor-cores NUM executor
Yarn-only:--executor-cores NUM The number of cores per executor, by default 1--queue queue_name the queue to which yarn is submitted by the application, default --num-executors Num starts the number of executor, which by default is 2--archives archives the list of files extracted to the working directory by each executor, separated by commas
For help information on the above spark-submit, there are a few things to emphasize:

With regard to--master--deploy-mode, under normal circumstances, it is not necessary to configure the --deploy-mode, using the following values to configure--master, using a similar--master spark://host:port-- Deploy-mode Cluster will submit the driver to cluster and then the worker to kill.

Master URL	Meaning
Local	using 1 worker threads to run a spark application locally
Local[k]	running the Spark application locally using a K worker thread
local[*]	use all remaining worker threads to run the spark application locally
Spark://host:port	Connect to a spark standalone cluster to run the spark application on that cluster
Mesos://host:port	Connect to the Mesos cluster to run the spark application on the cluster
yarn-client	connected to the yarn cluster in client mode, the location of the cluster is defined by the environment variable Hadoop_conf_dir, which driver run on the client.
Yarn-cluster	connected to the yarn cluster in a cluster manner, the location of the cluster is defined by the environment variable Hadoop_conf_dir, which driver also runs in the cluster.

If you want to use--properties-file, the attributes defined in--properties-file do not have to be defined in spark-sumbit, such as in conf/spark-defaults.conf Define the Spark.master, you can not use the--master. The priority for the Spark attribute is:sparkconf mode > Command line parameter mode > file configuration method, see Spark1.0.0 property configuration.
Unlike previous versions, Spark1.0.0 automatically passes the jar packages in its own jar package and--jars options to the cluster.
Spark uses the following URIs to handle file propagation:
- FILE://uses file://and absolute paths, which are provided by the driver HTTP server to provide file services, and each executor pulls files back from the driver.
- HDFs:, http:, https:, ftp:executor directly from URL pull back file
- Local:executor files that exist locally, do not need to be pulled back, or files that are shared via NFS network.
If you need to see where the configuration options are coming from, you can use the Open--verbose option to generate more detailed run information for your reference.

2: Test environment

The test program comes from using IntelliJ idea to develop the Spark1.0.0 application, which will test the two classes of WordCount1 and WordCount2.
Test data from Sogou user query log (sogouq), see Spark1.0.0 development environment Rapid build, although with this data set test is not ideal, but because its full version is large enough, you can split some of the data to test, in addition to other routines need to use, will adopt this data set. In the experiment, 100000 lines (SogouQ1.txt) and 200000 rows (SogouQ2.txt) were intercepted respectively.

3: Preparation A: Cluster switch to user Hadoop start Spark1.0.0 development environment set up a virtual cluster in a fast setup [[email protected] ~]$ su-hadoop[[email protected] ~]$ cd/app/ Hadoop/hadoop220[[email protected] hadoop220]$ Sbin/start-all.sh[[email protected] hadoop220]$ CD. /spark100/[[email protected] spark100]$ sbin/start-all.sh
B: The client switches to user Hadoop on the client and switches to the/app/hadoop/spark100 directory, uploads the experimental data to the Hadoop cluster, and then copies the packages generated using the IntelliJ idea Development Spark1.0.0 application. [email protected]:~/data$ su-hadoop[email protected]:~$ cd/app/hadoop/hadoop220[email protected] :/app/hadoop/hadoop220$ Bin/hdfs dfs-mkdir-p/dataguru/data[email protected]:/app/hadoop/hadoop220$ Bin/hdfs dfs-put/home/mmicky/data/sogouq1.txt/dataguru/data/[email protected]:/app/hadoop/hadoop220$ Bin/hdfs DFS- The put/home/mmicky/data/sogouq2.txt/dataguru/data/examines the SogouQ1.txt's block distribution, which is used when data locality is analyzed [email protected]:/app/ hadoop/hadoop220$ Bin/hdfs fsck/dataguru/data/sogouq1.txt-files-blocks-locations-racksconnecting to Namenode via HTT P://hadoop1:50070fsck started by Hadoop (auth:simple) from/192.168.1.111 for Path/dataguru/data/sogouq1.txt at Sat June 1 4 03:47:39 CST 2014/dataguru/data/sogouq1.txt 108750574 Bytes, 1 block (s): ok0. bp-1801429707-192.168.1.171-1400957381096:blk_1073741835_1011 len=108750574 repl=1 [/default-rack/192.168.1.171:50,010]
Check the block distribution of the SogouQ2.txt, which will be used in future data locality analysis.[Email protected]:/app/hadoop/hadoop220$ Bin/hdfs fsck/dataguru/data/sogouq2.txt-files-blocks-locations- Racksconnecting to Namenode via Http://hadoop1:50070FSCK started by Hadoop (auth:simple) from/192.168.1.111 for Path/dat Aguru/data/sogouq2.txt at Sat June 03:48:07 CST 2014/dataguru/data/sogouq2.txt 217441417 bytes, 2 block (s): OK0. bp-1801429707-192.168.1.171-1400957381096:blk_1073741836_1012 len=134217728 repl=1 [/default-rack/ 192.168.1.173:50010]1. bp-1801429707-192.168.1.171-1400957381096:blk_1073741837_1013 len=83223689 repl=1 [/default-rack/ 192.168.1.172:50010]
Switch to the Spark directory and copy the package [email protected]:/app/hadoop/hadoop220$ CD. /spark100[email protected]:/app/hadoop/spark100$ Cp/home/mmicky/ideaprojects/week2/out/artifacts/week2/week2.jar .
4: The experiment below gives the command of several experimental cases, the specific operating architecture will take a few examples in the Spark1.0.0 on Standalone running the schema instance parsing instructions.
There are a few things to keep in mind when submitting a spark application using Spark-submit:

When a client outside the cluster deploys a spark application to spark standalone, it is important to implement an SSH password-free login between the client and the spark standalone in advance.
When deploying a spark application to yarn, note the size of the executor-memory, its memory plus the memory container to use (the default is 1G) do not exceed NM of available memory, or you will not be allocated container to run executor.
The deployment of Python programs can be referenced in Python implementations of Spark1.0.0 programming and Spark1.0.0 on YARN mode deployment.

Spark1.0.0 Application Deployment Tool Spark-submit

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More