Debug Resource Allocation
The Spark's user mailing list often appears "I have a 500-node cluster, why but my app only has two tasks at a time", and since spark controls the number of parameters used by the resource, these issues should not occur. But in this chapter, you will learn to squeeze out every resource of your cluster. The recommended configuration will vary depending on the cluster management system (yarn, Mesos, Spark Standalone), and we will focus on yarn as this cloudera recommended way.
The two main resources that spark (and yarn) need to be concerned with are CPU and memory, and disk and IO certainly affect the performance of spark, but neither spark nor yarn can currently manage them in real time.
In a spark application, each spark executor has a fixed number of cores and a fixed-size heap size. The number of cores can be specified by parameter--executor-cores when executing spark-submit or Pyspark or Spark-shell, or in spark-defaults.conf configuration file or sparkconf Set the Spark.executor.cores parameter in the image. Similarly, the size of the heap can be configured by using the--executor-memory parameter or the Spark.executor.memory configuration item. The core configuration item controls the number of concurrent tasks in a executor. --executor-cores 5 means that there can be up to 5 tasks running in each executor. The memory parameter affects the size of the data that Spark can cache, that is, the maximum value of the structure shuffle in the group aggregate and join operations.
--num-executors command line arguments or spark.executor.instances configuration items control the number of executor required. Starting with CDH 5.4/spark 1.3, you can avoid using this parameter as long as you open the dynamic assignment by setting the spark.dynamicAllocation.enabled parameter. Dynamic allocation enables the application of Spark to request executor when there are subsequent backlogs in the awaited task, and to release these executor when idle.
At the same time, how the resources required for Spark match the resources available in yarn is also a key consideration, yarn-related parameters are:
- YARN.NODEMANAGER.RESOURCE.MEMORY-MB controls the maximum amount of memory that container can use on each node;
- Yarn.nodemanager.resource.cpu-vcores controls the maximum number of cores container can use on each node;
Requesting 5 cores generates a request to yarn for 5 virtual cores. Requesting memory from yarn is relatively complex because of some of the following reasons:
The--executor-memory/spark.executor.memory controls the size of the executor heap, but the JVM itself takes up a certain amount of heap space, such as an internal String or a direct byte buffer,executor The Spark.yarn.executor.memoryOverhead property of memory determines the size of each executor that is requested by yarn, and the default value is Max (384, 0.7 * spark.executor.memory);
Yarn may be a little higher than the requested memory, yarn's YARN.SCHEDULER.MINIMUM-ALLOCATION-MB and YARN.SCHEDULER.INCREMENT-ALLOCATION-MB The Value property controls the minimum and increment of the request.
The following shows the Spark on YARN memory structure:
If these are not enough to determine the number of spark executor, there are some concepts to consider:
The Master of the
- app, which is a non-executor container, has a special ability to request resources from YARN , and its own resources also need to be counted. In yarn-client mode, it requests 1024MB and a core by default. In Yarn-cluster mode, the master of the app runs driver, so it is often useful to configure its resources with parameters--driver-memory and--driver-cores.
- Configuring too large memory at executor execution often results in an excessive GC delay, and 64G is the recommended upper limit for a executor memory size.
- we notice HDFS client A performance issue when a large number of concurrent threads are. The approximate estimate is that a maximum of 5 parallel tasks per executor can fill the write bandwidth.
- throws away the benefit of running multiple tasks concurrently on a JVM when running micro executor, such as having only one core and only enough memory to execute a task. For example, broadcast variables need to be copied for each executor, so many small executor will result in more copies of the data.
To make these more specific, here is an example of a configuration that is actually used, which can be fully full of resources across the cluster. Suppose a cluster has 6 nodes with NodeManager running on it, and each node has 16 cores and 64GB of memory. Then the NodeManager capacity: YARN.NODEMANAGER.RESOURCE.MEMORY-MB and yarn.nodemanager.resource.cpu-vcores can be set to be the same as * 1024x768 = 64512 (MB) and 15. We avoid using 100% of YARN container resources because we also have to leave some resources for the OS and Hadoop Daemon. In the above scenario, we reserved 1 cores and 1G of memory for these processes. Cloudera Manager is automatically calculated and configured.
So it looks like the first thing we think of is the configuration:--num-executors 6--executor-cores 63G. However, this configuration may not meet our needs because:
-63gb+ executor memory plugs into only 63GB capacity of NodeManager;
-The Master of the application also needs to occupy a core, which means that no 15 cores are used for executor on a node;
-15 cores can affect the throughput of HDFS IO.
Configuration to--num-executors--executor-cores 5--executor-memory 19G may be better because:
-This configuration will generate 3 executor on each node, except for the application's master run machine, which will only run 2 executor
---executor-memory is divided into 3 parts (63g/per node 3 executor) = 21. 21 * (1-0.07) ~ 19.
Debugging Concurrency
We know that Spark is a set of data parallel processing engines. But Spark is not magically able to parallelize all computations, and it has no way of finding the optimal one from all parallelization scenarios. Each Spark stage contains several tasks, and each task processes the data serially. When you debug a Spark job, the number of tasks may be the most important parameter that determines the performance of the program.
So what is the decision of this number? The previous blog post introduced how Spark transforms the RDD into a set of stages. The number of tasks is the same as the number of partition on the previous RDD in the stage. The number of partition for an RDD is the same as the number of partition it relies on, except for the following:COALESCETransformation can create an RDD with fewer partition numbers,UnionThe number of partition of the RDD produced by transformation is the sum of the number of partition of its parent RDD,CartesianThe number of partition returned by the RDD is their product.
What if an RDD does not have a parent rdd? BytextfileOrHadoopfileThe number of partition generated by the RDD is determined by the MapReduce InputFormat they use at the bottom. In general, a partition is generated for each HDFS block that is read. PassparallelizeThe partition number of the RDD generated by the interface is specified by the user and is determined by the parameter spark.default.parallelism if the user is not specified.
To know the number of partition, you can get it through the interface rdd.partitions (). Size ().
The most worrying problem here is that the number of tasks is too small. If the number of run-time tasks is less than the actual available slots, then the program cannot use all of the CPU resources.
Too few task counts can cause the memory pressure of each task to be large during some aggregation operations. AnyJoin,Cogroup,*bykeyOperations generate a Hash-map or buffer in memory for grouping or sorting.Join,Cogroup,GroupbykeyThese data structures are used at the fetching end of the shuffle,Reducebykey,AggregatebykeyThese data structures will be used at both ends when shuffle.
Some problems can be exposed when the record that requires this aggregation operation is not completely easily crammed into memory. First, the memory hold of a large number of these data structures will increase the pressure on the GC and may cause the process to stall. Second, if the data is not fully loaded into memory, Spark writes the data to disk, which causes disk IO and sorting. Among Cloudera users, this could be the primary cause of slow Spark Job.
So how do you increase the number of partition? If your problem stage is reading data from Hadoop, you can do the following options:
-UsingRepartitionoption that will trigger shuffle;
-Configure InputFormat users to divide files smaller;
-Use a smaller block when writing to HDFS files.
If the problem stage gets input from another stage, the operation that raises the stage boundary accepts a numpartitions parameter, such as
What value should X take? The most direct way is to do the experiment. Keep the number of partition from the number of partition of the last experiment multiplied by 1.5, until performance is no longer increased.
There are also some principles for computing X, but it is not very effective because some parameters are difficult to calculate. It is not written here because they are practical, but they can help to understand. The main goal here is to start enough tasks so that the data accepted by each task can be crammed into the memory it allocates.
The memory available for each task is calculated using this formula: Spark.executor.memory * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction)/ Spark.executor.cores. The default values for Memoryfraction and Safetyfractio are 0.2 and 0.8, respectively.
The size of all shuffle data in memory is difficult to determine. It is most feasible to find out the ratio between the Shuffle spill (memory) and the Shuffle spill (Disk) running on the stage. Use all shuffle to write multiplied by this ratio. But if this stage is reduce, it can be a bit complicated:
Add a little to the top because in most cases the number of partition will be more.
Try to use more task numbers (that is, partition number) to be more effective when in doubt, as opposed to choosing the most conservative recommendation for the number of tasks in Maprecuce. This is because MapReduce requires a greater price than when it starts a task.
Compress your data structure
The data flow of Spark is composed of a set of record. A record has two representations: one is a deserialized Java object and the other is a serialized binary form. Typically, Spark uses the form after deserialization for the in-memory record to use the serialized form for a record that is to be stored on disk or that needs to be transmitted over the network. There are also plans to store the record after serialization in memory.
Spark.serializer controls the way the transitions between these two forms. Kryo Serializer,org.apache.spark.serializer.kryoserializer is the recommended choice. Unfortunately, it is not the default configuration because Kryoserializer is unstable in earlier versions of Spark, and Spark does not want to break version compatibility, so Kryoserializer is not configured as the default, but Kryoserializer Should be the first choice under any circumstances.
The frequency with which your record is switched in these two forms has a significant impact on the operational efficiency of the Spark application. Check out the types of data passing around, and it's worth trying to see if you can squeeze a little water.
Excessive deserialization of the record may result in data being made more frequent on disk, as well as reducing the number of record numbers that can be Cache in memory. Click here to see how to compress this data.
Excessive serialization of the record results in more disk and network IO, as well as a reduction in the number of record caches in memory, and the main solution here is to pass all user-defined classes through SPARKCONF#REGISTERKRYOCL The Asses API is defined and passed.
Data Format
At any time you can decide how your data stays on disk, using an extensible binary format such as: Avro,parquet,thrift or Protobuf, to choose from. When people talk about using Avro,thrift or protobuf on Hadoop, they think that each record remains a AVRO/THRIFT/PROTOBUF structure that is saved as a sequence file. Rather than JSON.
Every time I try to use JSON to store a lot of data, let's just give up ...
"Reprinted from: http://blog.csdn.net/u012102306/article/details/51700664"
"Reprint" Apache Spark Jobs Performance Tuning (ii)