Cause analysis and resolution of spark cluster cannot be stopped

Last Update:2015-08-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Today I want to stop the spark cluster and find that the related process of spark cannot be stopped when executing stop-all.sh. Tips:

No org.apache.spark.deploy.master.Master to stop

No org.apache.spark.deploy.worker.Worker to stop

On the Internet to check some information, and then look at the stop-all.sh,stop-master.sh,stop-slaves.sh,spark-daemon.sh,spark-daemons.sh and other scripts, found it is likely because the $ A spark_pid_dir environment variable is responsible for the result.

1. Cause Analysis

I'm building a hadoop2.6.0+spark1.1.0+yarn cluster. The stops of Spark, Hadoop, and yarn are all manipulated by some xxx.pid files. Take Spark's stop-master as an example, where the Stop statement is as follows:

Then look at the actions in spark-daemon.sh:

$SPARK the pid file stored in the _pid_dir is the PID to stop the process. Where $spark_pid_dir is the default in the system's/tmp directory:

The system clears the contents of the/tmp directory at intervals. To/TMP under the view, sure enough there is no relevant process PID file. This led to the stop-all.sh unable to stop the cluster.

2. Stop the Spark cluster

Fearing that using kill to force stop spark-related processes would break the cluster, consider replying to the PID file under/tmp and then using STOP-ALL.SH to stop the cluster.

Analyze the spark-daemon.sh script and see the PID file naming rules as follows:

pid= $SPARK _pid_dir/spark-$SPARK _ident_string-$command-$instance. PID

which

$SPARK _pid_dir is/tmp

$SPARK _ident_string is the login user $user, the username in my cluster is CDAHDP

$command is a parameter when calling spark-daemon.sh, there are two:

Org.apache.spark.deploy.master.Master

Org.apache.spark.deploy.worker.Worker

$instance is also a parameter when calling spark-daemon.sh, my cluster is 1

So the PID file name is as follows:

/tmp/spark-cdahdp-org.apache.spark.deploy.master.master-1.pid

/tmp/spark-cdahdp-org.apache.spark.deploy.worker.worker-1.pid

View the PID of the related process via JPS:

Save the PID to the corresponding PID file.

After calling Spark's stop-all.sh, you can stop the spark cluster normally.

3. Stop Hadoop and yarn clusters

This behavior also occurs when you stop Hadoop and yarn clusters by calling stop-all.sh. One of the namenode,secondarynamenode,datanode,nodemanager,resourcemanager, such as Hadoop and yarn related processes, stop because the PID can not be found to stop. Analysis method with spark, corresponding PID file name is different.

PID naming conventions for Hadoop:

pid= $HADOOP _pid_dir/hadoop-$HADOOP _ident_string-$command. PID

PID File name:

/tmp/hadoop-cdahdp-namenode.pid

/tmp/hadoop-cdahdp-secondarynamenode.pid

/tmp/hadoop-cdahdp-datanode.pid

The PID naming rules for yarn:

pid= $YARN _pid_dir/yarn-$YANR _ident_string-$command. PID

PID File name:

/tmp/yarn-cdahdp-resourcemanager.pid

/tmp/yarn-cdahdp-nodemanager.pid

Restoring these PID files can stop the Hadoop and yarn processes using stop-all.sh.

4. Radical solution

To eradicate this problem, simply set $spark_pid_dir on all nodes in the cluster, $HADOOP _pid_dir and $yarn_pid_dir.

Modify hadoop-env.sh, add:

Export Hadoop_pid_dir=/home/ap/cdahdp/app/pids

Modify yarn-env.sh, add:

Export Yarn_pid_dir=/home/ap/cdahdp/app/pids

Modify spark-env.sh, add:

Export Spark_pid_dir=/home/ap/cdahdp/app/pids

After starting the cluster, look at the/home/ap/cdahdp/app/pids directory, as follows:

Cause analysis and resolution of spark cluster cannot be stopped

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Cause analysis and resolution of spark cluster cannot be stopped

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Cause analysis and resolution of spark cluster cannot be stopped

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support