Apache Spark Technical Combat 6--standalone temporary file cleanup in deployment mode

Last Update:2015-11-20 Source: Internet

Author: User

Tags cassandra shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Questions Guide

1. In standalone deployment mode, what temporary directories and files are created during spark run?
2. Are there several modes in standalone deployment mode?
3. What is the difference between client mode and cluster mode?

Profile

In standalone deployment mode, which temporary directories and files are created during the spark run, and when these temporary directories and files are cleaned up, this article will do a thorough and detailed answer to these questions. From the point of view of resource usage, a process runs through four resources, namely CPU, memory, disk, and network. After the process exits, the CPU, memory, and network are freed by the operating system, but temporary files are generated during the run if the process itself is not effectively purged before exiting, it leaves a feather and wastes a valid storage space.

Third-party dependencies at deployment time

Before asking specific questions, recall that standalone's deployment model is divided into client mode and cluster mode under standalone, where driver and client run in the same JVM, not started by the worker , the JVM process does not exit until the results of the spark application calculation have finished returning. As shown in. In cluster mode, driver is initiated by the worker, and the client exits directly after confirming that the spark application is successfully submitted to cluster, and does not wait for the spark application run result to return. As shown in the spark access Cassandra

From the deployment diagram, the file dependencies of each JVM process at startup are met.

The master process is the simplest, except for spark jar packages, where there is no third-party library dependency
Driver and executor are likely to have third-party package dependencies at run time, separately speaking
- Driver is simpler, and Spark-submit will specify where to read the jar file to be relied upon when committing.
- The executor is started by the worker, and the worker needs to download the jar file that is needed for the executor startup, then where to download it.

In order to solve the jar problem that is dependent on executor startup, driver starts Httpfileserver storage of the third-party jar package, which is then fetched by the worker from Httpfileserver. To do this httpfileserver need to create the appropriate directory, and the worker also needs to create the appropriate directory. Httpfileserver the process of creating a directory is described in the CREATE function in Sparkenv.scala. SPARK generates a folder for each submitted application, which, by default, is located in the $spark_home/work directory, which holds the third-party library dependencies downloaded from Httpfileserver and the log information generated by the executor runtime.

Experiment 1

Run Spark-shell to see which directories will be newly generated in the/tmp directory. # $SPARK _home/bin/spark-shell four new Spark-shell-related folders are added to the/tmp directory spark+ a random number directory for driver itself, driver created TMP directory, Httpfileserver created directory spark-local directory to hold shuffle output and cache content generated during executor execution

Temporary files in the run

When the executor is running, it generates shuffle Output, and if the RDD is cache, it is possible to spit the contents of the RDD onto the disk. All of this means that you need to have a folder to hold these things. The spark-local-*-like directory mentioned above is used to store temporary files generated by the executor runtime. There are two simple experiments to look at the changes in the contents of the spark-local-* directory.

Lab 2: No Rdd Cache

Run after entering Spark-shell

Spark-shell>sc.textfile ("Readme.md"). FlatMap (L=>l.split ("")). Map (w=> (w,1)). (_ + _). foreach ( println

Copy CodeThe above instructions will generate two different stages, so there will be shuffle Output, the specific reason is no longer detailed. If you are using Spark 1.2.x, you can see that there is an index file generated in the spark-local-* directory.

Lab 3: Carry out the Rdd Cache

Run after entering Spark-shell

Spark-shell>val rdd1 = Sc.textfile ("readme.md"). FlatMap (L=>l.split (")"). Map (w=> (w,1)). Reducebykey (_ + _)
Spark-shell> rdd1.persist (Memory_and_disk_ser)
Spark-shell>rdd1.foreach (println)

Copy Code

After the above instruction is executed, not only will the index file also be generated as rdd* files, these rdd files are the cache content.

Configuration Items

You can change the default storage location by specifying the configuration content in $spark_home/conf/spark-env.sh. SPARK_WORK_DIR Specifies the work directory, the default is the $spark_home/work subdirectory spark_local_dirs Specifies executor run the generated temporary file directory, the default is/tmp, due to/ The TMP directory may have been TMPFS, and it is recommended that you change it to a different directory in your actual deployment

Cleanup of files

When will the temporary files generated during the above process be deleted? Perhaps the first feeling is that spark application at the end of the run, and intuition is sometimes not necessarily right. The resulting folder under Spark_local_dirs does automatically clean up when the application exits, and if you watch carefully, you will find files such as *_cache and *_lock in the Spark_local_dirs directory that are not automatically erased. This is a bug that can be corrected in spark 1.3. For a specific description of the bug, refer to *_ under spark-4323 https://issues.apache.org/jira/browse/spark-4323$spark_local_dirs The cache file is a patch introduced in order to avoid multiple downloads of third-party dependencies on the same machine when several executor perform the same application, as described in Jira case spark-2713. For the code, see the Fetchfile function in Spark/util/utils.java. https://issues.apache.org/jira/browse/SPARK-2713 if it is already in use, what is the way to clear it? Brute force Delete, regardless of 3,721, will delete all existing caches and lock for a period of time. It's not going to have any side effects, big deal executor to download it again.

Find $SPARK _local_dirs-max-depth 1-type f-mtime 1-exec rm--{} \;

Copy Code

Folders like app-timestamp-seqid in the Spark_work_dir directory are not automatically cleared by default.

What options can be set to automatically clear the application folder that has stopped running? Of course.

Add the following to the spark-env.sh

Spark_worker_opts= "-dspark.worker.cleanup.enabled=true"

Copy CodeNote that the official documentation says that regardless of whether the program has been stopped, the folder will be deleted, which is inaccurate, only the program folder will be deleted, I have submitted the corresponding PR.

Experiment 4

Write a simple wordcount, and then standalone cluster mode to submit a run to see the changes in the contents of the $spark_local_dirs file.

Import Org.apache.spark._
Import Org.apache.spark. {sparkconf, Sparkcontext}
Import Org.apache.spark.sparkcontext._
Import Java.util.Date
Object HelloApp {
def main (args:array[string]): Unit = {
Val conf = new sparkconf ()
Val sc = new Sparkcontext ()
Val fileName = "$SPARK _home/readme.md"
Val rdd1 = Sc.textfile (fileName). FlatMap (L = L.split ("")). Map (w = (W, 1))
Rdd1.reducebykey (_ + _). foreach (println)
var i:int = 0
while (I < 10) {
Thread.Sleep (10000)
i = i + 1
}
}
}

Copy Code

Submit Run

Spark-submit–class helloapp–master spark://127.0.0.1:7077--deploy-mode cluster Helloapp.jar

Copy Code

Summary

In this paper, we observe the generation and elimination of temporary files in standalone mode through several simple experiments, hoping to help understand the application and release process of disk resources in spark. Spark deployment is related to a lot of configuration items, if the first classification, and then go to the configuration is much easier, categorized by CPU, Memory, Network, Security, disk and Akka related.

related articles

Spark Technical Combat 1-Kafkawordcount
http://www.aboutyun.com/thread-9580-1-1.html

Spark Technology 2-spark CASSANDRA Connector installation and use
http://www.aboutyun.com/thread-9582-1-1.html

Spark Technical combat 3-- Use Spark to import a JSON file into Cassandra
http://www.aboutyun.com/thread-9583-1-1.html

Apache Spark Technology 4-SPARKR installation and use
http://www.aboutyun.com/thread-10082-1-1.html

Apache Spark Technical Combat 5--Spark-submit Frequently asked questions and solutions
http://www.aboutyun.com/ thread-10083-1-1.html

Apache Spark Technical Combat 6--standalone temporary file cleanup in deployment mode
http://www.aboutyun.com/thread-11862-1-1.html

Http://www.tuicool.com/articles/RV3MFz

Apache Spark Technology Combat 6--standalone temporary file cleanup in deployment mode

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More