Apache Spark Technical Combat 6--Spark-submit FAQ and its solution

Source: Internet
Author: User

In addition to my consent, prohibited all reprint, emblem Shanghai one lang.

Profile

After you have written a standalone spark application, you need to commit it to spark cluster, and generally use Spark-submit to submit your application, what do you need to be aware of in the process of using spark-submit?

This article tries to make a small summary of this.

Spark-defaults.conf

Spark-defaults.conf the scope of the action to be clear, edit the driver on the machine spark-defaults.conf, the file will affect the driver submitted to run application, And a executor that provides compute resources specifically for the application.

You only need to edit the file on the machine where the driver resides, and you do not need to edit the file on the machine that the worker or master is running on.

To give a practical example

spark.executor.extraJavaOptions   -XX:MaxPermSize=896mspark.executor.memory   5gspark.serializer        org.apache.spark.serializer.KryoSerializerspark.cores.max32spark.shuffle.managerSORTspark.driver.memory2g

The above configuration indicates that the heap memory needs to be 5g when the application provides compute resources for executor startup.

It is important to note that if a worker joins cluster, stating that his machine has only 4g of memory, then assigning executor to the above application is that the worker cannot provide any resources because 4g<5g, Unable to meet the minimum resource requirements.

spark-env.sh

Spark-env.sh the most important is to specify the IP address, if you are running MASTER, you need to specify SPARK_MASTER_IP, if you are ready to run driver or worker will need to specify SPARK_LOCAL_IP, and the IP address of this machine, Otherwise it won't start.

Configuration examples are as follows

export SPARK_MASTER_IP=127.0.0.1export SPARK_LOCAL_IP=127.0.0.1

Start the Spark cluster

The first step starts the master

$SPARK_HOME/sbin/start-master.sh

The second step starts the worker

$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077

Replace master with the IP address that master actually runs

If you want to run multiple workers on a single machine (primarily for testing purposes), you need to specify the contents of the-webui-port when starting the second and subsequent worker, otherwise you will newspapers the error that the port is already occupied, start the second with a 8083, and a third with 8084, And so on

$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077    –webui-port 8083

This way to start the worker only for testing is easy to start, the formal way is to use spark_home/sbin/start-slaves.sh to start a number of workers, due to the configuration of SSH, compared to the trouble, I this is a simple way to figure.

When starting a worker with $spark\_home/sbin/start-slave.sh$, there is a default premise that $spark_home must be in the same directory on each machine.

Use the same user name and user group to start master and worker, or the executor will report an error that the connection cannot establish after it is started.

I in the actual use, encountered "No route to host" error message, initially thought the network is not configured, and then after the network reasons, suddenly realize that it is possible to use a different user name and user group, after using the same user name/user group, the problem disappears.

Spark-submit

After the spark cluster is working, the next problem is to commit the application to the cluster to run.

Spark-submit is used for the submission and operation of Spark application, and the biggest confusion when using this instruction is how to specify the dependency packages that the application requires.

First look at the Spark-submit Help file

$SPARK_HOME/bin/submit --help

There are several options that you can use to specify the libraries that you depend on, respectively,

    • --driver-class-path driver depends on the package, multiple packets are separated by a colon (:)
    • --jars driver and executor all require packages, separated by commas (,) between multiple packages

For the sake of simplicity, the dependency is specified by-jars, and the Run command is as follows

$SPARK_HOME/bin/spark-submit –class 应用程序的类名 --master spark://master:7077 --jars 依赖的库文件 spark应用程序的jar包

When you need reminders, these files that are uploaded to the worker need to be manually cleaned up at regular intervals, otherwise it will take up a lot of disk space

Question 1

Since spark will store intermediate results in the/tmp directory while computing, Linux now supports TMPFS, in fact, it is simply to mount the/tmp directory into memory.

Then there is a problem, the middle result is too much cause the/tmp directory is full and the following error occurred

No Space left on the device

The workaround is to not enable TMPFS for the TMP directory, modify the/etc/fstab

Question 2

Sometimes you may encounter Java.lang.OutOfMemory, unable to create new native thread error, which causes this error more.

One situation is not really due to insufficient memory, but because it exceeds the maximum allowable number of file handles or the maximum number of processes.

The troubleshooting step is to look at the number of file handles that are allowed open and the maximum number of processes, and if the values are too low, use ulimit to increase them, and then try again if the problem is resolved.

ulimit -a

Modify the maximum number of processes allowed to open

ulimit -u 65535

Modify a file handle that is allowed to open

ulimit -n 65535
Spark-shell

The above describes how Spark-submit submits spark application to solve the problem of dependent libraries, so what if it is Spark-shell?

Spark-shell, use the--driver-class-path option to specify the jar file that you depend on, and note that if you need to follow multiple jar files after--driver-class-path, the jar file is separated by a colon (:).

Summary

This part of the content has been published by my emblem Shanghai Keiichiro in CSDN through the "use of Spark+cassandra to build high-performance data analysis platform."

Apache Spark Technical Combat 6--Spark-submit FAQ and its solution

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.