Spark Memory parameter tuning

Source: Internet
Author: User
Tags data structures garbage collection join advantage

Original address: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

--

In the conclusion to this series, learn how resource tuning, parallelism, and data representation affect Spark job perform Ance.

In this post, we'll finish what we started in  "How to Tune Your Apache Spark Jobs (Part 1)".  i ' ll try to CoV Er pretty much everything could care to know on making a Spark program run fast. in particular, you ' ll learn a Bout resource tuning, or configuring Spark to take advantage of everything the cluster have to offer. Then we'll move to tuning parallelism, the most difficult as well as most important parameter in job Performance. fin Ally, you'll learn about representing the data itself, in the On-disk form which Spark would read (spoiler Alert:use Apach e Avro or Apache parquet) as well as the in-memory format it takes as it ' s cached or moves through the system. Tuning Resource Allocation

The Spark user list is a litany of questions to the effect of "I had a 500-node cluster, but when I run my application, I See only the tasks executing at a time. Halp. " Given the number of parameters that control Spark's resource utilization, these questions aren ' t unfair, but in this secti On your ' ll learn how to squeeze every the last bit of the juice out of your cluster. The recommendations and configurations here differ a little bit between Spark ' s cluster managers (YARN, Mesos, and Spark s Tandalone), but we ' re going to focus only on YARN, which Cloudera recommends to all users.

For some background in what it looks like to run Spark on YARN, check out my post on the this topic.

The both main resources that Spark (and YARN) think about is CPU and memory. Disk and network I/O, of course, play a part in spark performance as well, but neither spark nor YARN currently do Anythin G to actively manage them.

Every Spark executor in a application has the same fixed number of cores and same fixed heap size. The number of cores can specified with The --executor-cores flag when invoking Spark-submit, Spark-shell, and Pyspark from the command line, or by setting The spark.executor.cores property in the spark-defaults.conf  file or on A sparkconf object. Similarly, the heap size can be controlled with The --executor-memory flag or the spark.executor.memory& Nbsp;property.  the  Cores  property controls the number of concurrent tasks an executor can run .  --executor-cores 5  ; means the executor can run a maximum of five tasks at the same time. the Memory property impacts the amount of data Spark can cache , as well as the maximum sizes of the Shuff Le data structures used for grouping, aggregations, and joins.

The--num-executors command-line flag or Spark.executor.instances configuration property control the number of Executo RS requested. Starting in CDH 5.4/spark 1.3, you'll be able to avoid setting this property by turning on dynamic allocation with the S Park.dynamicAllocation.enabled property. Dynamic allocation enables a Spark application to request executors when there are a backlog of pending tasks and free up E Xecutors when idle.

It's also important to think on the resources requested by Spark would fit into what YARN has available. The relevant YARN properties ARE:YARN.NODEMANAGER.RESOURCE.MEMORY-MB controls the maximum sum of memory used by the C Ontainerson each node. Yarn.nodemanager.resource.cpu-vcores controls the maximum sum of cores used by the containers on each node.

Asking for five executor cores would result in a request to YARN for five virtual cores. The memory requested from YARN are a little more complex for a couple reasons:--executor-memory/spark.executor.memory cont Rols The executor heap size, but JVMs can also use some memory off heap, for example for interned Strings and Dir ECT byte buffers. The value of the Spark.yarn.executor.memoryOverhead property was added to the executor memory to determine the full memory Request to YARN for each executor. It defaults to Max (384,0.07 * spark.executor.memory). YARN may round the requested memory up a little. YARN ' s Yarn.scheduler.minimum-allocation-mband yarn.scheduler.increment-allocation-mb properties control the minimum and increment request values respectively.

The following of the defaults shows the hierarchy of memory properties in Spark and YARN:


And if that weren ' t enough to think on, a few final concerns when sizing Spark executors:theApplication Master,Which is a non-executor container with the special capability of requesting containers from YARN, takes up resources of it s own that must is budgeted in. In yarn-client mode, it defaults to a 1024MB and one vcore. in yarn-cluster mode, the Application Master runs the driver, so it's often useful to bolster their resources with THE --DRIVER-MEMORY AND&N Bsp;--driver-cores properties. Running executors with too much memory often results in excessive garbage collection delays. 64gb are a rough guess at A good upper limit for a single executor. I ' ve noticed that the HDFS client have trouble with tons of concurrent threads. A Rough guess is thatat most 5 tasks per executor can achieve full write throughput, so it's good to keep the number of CO Res per executor below that number. Running Tiny executors (with a single core and just enough memory needed to run a single task, for example) throws away th E benefits that come from running multiple tasks in a Single JVM. For example, broadcast variables need to being replicated once on each executor, so many small executors would result in many More copies of the data.

To hopefully make all of this a little more concrete, here's a worked example of configuring a Spark App to use as much of the cluster as possible:imagine a cluster with six nodes running Nodemanagers, each equipped with 1 6 cores and 64GB of memory. the NodeManager capacities, yarn.nodemanager.resource.memory-mb and  Yarn.nodemanager.resource.cpu-vcores, should probably is set to the other * 1024x768 = 64512 (megabytes) and respectively. we Avoid allocating 100% of the resources to YARN containers because the node needs some resources to run the OS and Hadoop Daemons. In this case, we leave a gigabyte and a core for these system processes. Cloudera Manager helps by accounting for these and configuring these YARN properties automatically.

The likely first impulse would be-use--num-executors 6--executor-cores--executor-memory 63G. However, this is the wrong approach BECAUSE:63GB + The executor memory overhead won ' t fit within the 63GB capacity of the Nodemanagers. The application master would take up a core in one of the nodes, meaning that there won ' t is the class for a 15-core executor on That node. Cores per executor can leads to bad HDFS I/O throughput.

A better option would is to use--num-executors--executor-cores 5--executor-memory 19G. Why? This config results in three executors to all nodes except for the one with the AM, and which would has both executors.    --executor-memory was derived as (63/3 executors per node) = 21.     21 * 0.07 = 1.47. 21–1.47 ~ 19. Tuning Parallelism

Spark, as you had likely figured out by this point, is a parallel processing engine. What's maybe less obvious was that Spark was not a "magic" parallel processing engine, and was limited in it ability to fig Ure out the optimal amount of parallelism. Every Spark stage has a number of tasks, each of which processes data sequentially. In tuning Spark jobs, the "This" is probably the "important parameter in determining performance.

How are this number determined? The "the"-the-groups RDDs into stages are described in the previous post. (as a quick reminder, transformations like repartition and reducebykey induce stage boundaries.) The number of the tasks in a stage are the same as the number of partitions in the last RDD in the stage. The number of partitions in an RDD are the same as the number of partitions in the the RDD on which it depends, with a couple E Xceptions:thecoalesce transformation allows creating an RDD with fewer partitions than its parent RDD, The unio N transformation creates an RDD with the sum of its parents ' number of partitions, and cartesian creates an RDD with their product.

What is about RDDs with no parents? RDDs produced by Textfile or Hadoopfile has their partitions determined by the underlying MapReduce inputformat that ' s US Ed. Typically there is a partition for each HDFS block being read. Partitions for RDDs produced by parallelize come from the parameter given by the user, or spark.default.parallelism if non E is given.

To determine the number of partitions in a RDD, you can always call Rdd.partitions (). Size ().

The primary concern is, the number of tasks would be too small. If there is fewer tasks than slots available to run them in, the stage won ' t is taking advantage of all the CPU available .

A small number of tasks also mean that more memory pressure are placed on any aggregation operations this occur in each Task. Any join, cogroup, or *bykey operation involves holding objects in hashmaps or in-memory buffers to G Roup or Sort. join, cogroup, and groupbykey use these data structures in the tasks for the stages that is on the fetching side of the shuffles they trigger. reducebykey and aggregatebykeyuse data structures I n the tasks for the stages on both sides of the shuffles they trigger.

When the records destined for these aggregation operations does not easily fit in memory, some mayhem can ensue. First, holding many records in these data structures puts pressure on garbage collection, which can leads to pauses down th E line. Second, when the records does not fit in memory, Spark would spill them to disk, which causes disk I/O and sorting. This overhead during large shuffles is probably the number one cause of job stalls I had seen at Cloudera customers.

So how does you increase the number of partitions? If the stage in question are reading from Hadoop, your options are:use the repartition transformation, which would trigger A shuffle. Configure your inputformat to create more splits. Write the input data out to HDFS with a smaller block size.

If the stage is getting it input from another stage, the transformation that triggered the stage boundary would accept a n Umpartitions argument, such as

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.