Spark Performance Tuning Guide-Basics

Source: Internet
Author: User
Tags bulk insert class operator map class

Objective

In the field of big data computing, Spark has become one of the increasingly popular and increasingly popular computing platforms. Spark's capabilities include offline batch processing in big data, SQL class processing, streaming/real-time computing, machine learning, graph computing, and many different types of computing operations, with a wide range of applications and prospects. In the mass reviews, many students have tried to use spark in various projects. Most students, including the author, initially started experimenting with spark for the simple reason that big data computing jobs were executed faster and with higher performance.

However, it is not that simple to develop high-performance big data computing jobs with spark. If the spark job is not properly tuned, the spark job may be slow to perform, which completely does not give spark the benefit of being a fast big data computing engine. Therefore, if you want to use spark properly, you must optimize it for performance.

The performance tuning of Spark is actually made up of a lot of parts, not a few parameters can be adjusted to improve the performance of the job immediately. We need to perform a comprehensive analysis of the spark job based on different business scenarios and data conditions, and then adjust and optimize it in many ways to get the best performance.

Based on the previous experience of spark job development and the accumulation of practice, the author summarizes a set of performance optimization schemes for spark operation. The whole package is divided into several parts: development tuning, resource tuning, data tilt tuning and shuffle tuning. Development tuning and resource tuning are some of the basic principles that all spark jobs need to be aware of and follow, and are the basis for high-performance spark jobs, data skew tuning, and a complete set of solutions to solve spark job data skew; shuffle tuning, It is aimed at the students who have a deeper understanding and research on the principle of spark, and mainly explains how to tune the shuffle running process and details of spark operation.

This article is a basic part of the Spark Performance Optimization Guide, which focuses on development tuning and resource tuning.

Overview of Development tuning tuning

The first step in Spark performance optimization is to pay attention to and apply some of the basic principles of performance optimization during the development of spark jobs. Development tuning, is to let everyone understand the following spark basic development principles, including: RDD lineage design, rational use of operators, special operation optimization. In the development process, every moment should pay attention to the above principles, and these principles according to the specific business and practical application scenarios, the flexibility to apply to their own spark operations.

Principle one: Avoid creating duplicate Rdd

In general, when we develop a spark job, we first create an initial RDD based on a data source (such as a hive table or HDFs file), perform an operator operation on the RDD, and then get the next Rdd, and so on, until we calculate the results we need eventually. In this process, multiple RDD will be connected by different operator operations (such as map, reduce, etc.), this "Rdd string" is the Rdd lineage, which is the "Rdd kinship chain."

We should note in the development process: for the same data, you should create only one rdd, not multiple rdd to represent the same data.

When some spark beginners start developing spark jobs, or when experienced engineers develop the Rdd lineage extremely lengthy spark job, they may forget that they have already created an RDD for a single piece of data, resulting in multiple rdd for the same data. This means that our spark job will perform multiple repetitions to create more than one rdd representing the same data, thus increasing the performance overhead of the job.

A simple example
// 需要对名为“hello.txt”的HDFS文件进行一次map操作,再进行一次reduce操作。也就是说,需要对一份数据执行两次算子操作。// 错误的做法:对于同一份数据执行多次算子操作时,创建多个RDD。// 这里执行了两次textFile方法,针对同一个HDFS文件,创建了两个RDD出来,然后分别对每个RDD都执行了一个算子操作。// 这种情况下,Spark需要从HDFS上两次加载hello.txt文件的内容,并创建两个单独的RDD;第二次加载HDFS文件以及创建RDD的性能开销,很明显是白白浪费掉的。val rdd1 = sc.textFile("hdfs://192.168.0.1:9000/hello.txt")rdd1.map(...)val rdd2 = sc.textFile("hdfs://192.168.0.1:9000/hello.txt")rdd2.reduce(...)// 正确的用法:对于一份数据执行多次算子操作时,只使用一个RDD。// 这种写法很明显比上一种写法要好多了,因为我们对于同一份数据只创建了一个RDD,然后对这一个RDD执行了多次算子操作。// 但是要注意到这里为止优化还没有结束,由于rdd1被执行了两次算子操作,第二次执行reduce操作的时候,还会再次从源头处重新计算一次rdd1的数据,因此还是会有重复计算的性能开销。// 要彻底解决这个问题,必须结合“原则三:对多次使用的RDD进行持久化”,才能保证一个RDD被多次使用时只被计算一次。val rdd1 = sc.textFile("hdfs://192.168.0.1:9000/hello.txt")rdd1.map(...)rdd1.reduce(...)
Principle two: Reuse the same rdd as much as possible

In addition to avoiding creating multiple rdd for an identical piece of data during development, it is necessary to reuse an RDD as much as possible when performing operator operations on different data. For example, one of the data formats for an RDD is the Key-value type and the other is a single value type, and the value data for the two Rdd is exactly the same. So at this point we can just use the Key-value type of RDD, because it already contains another data. For data that resembles this multiple rdd, we should try to reuse an RDD as much as possible, so that the number of RDD can be minimized to minimize the number of operator executions.

A simple example
// 错误的做法。// 有一个<Long, String>格式的RDD,即rdd1。// 接着由于业务需要,对rdd1执行了一个map操作,创建了一个rdd2,而rdd2中的数据仅仅是rdd1中的value值而已,也就是说,rdd2是rdd1的子集。JavaPairRDD<Long, String> rdd1 = ...JavaRDD<String> rdd2 = rdd1.map(...)// 分别对rdd1和rdd2执行了不同的算子操作。rdd1.reduceByKey(...)rdd2.map(...)// 正确的做法。// 上面这个case中,其实rdd1和rdd2的区别无非就是数据格式不同而已,rdd2的数据完全就是rdd1的子集而已,却创建了两个rdd,并对两个rdd都执行了一次算子操作。// 此时会因为对rdd1执行map算子来创建rdd2,而多执行一次算子操作,进而增加性能开销。// 其实在这种情况下完全可以复用同一个RDD。// 我们可以使用rdd1,既做reduceByKey操作,也做map操作。// 在进行第二个map操作时,只使用每个数据的tuple._2,也就是rdd1中的value值,即可。JavaPairRDD<Long, String> rdd1 = ...rdd1.reduceByKey(...)rdd1.map(tuple._2...)// 第二种方式相较于第一种方式而言,很明显减少了一次rdd2的计算开销。// 但是到这里为止,优化还没有结束,对rdd1我们还是执行了两次算子操作,rdd1实际上还是会被计算两次。// 因此还需要配合“原则三:对多次使用的RDD进行持久化”进行使用,才能保证一个RDD被多次使用时只被计算一次。
Principle three: Persistence for multiple-use RDD

Once you have manipulated an RDD operator in the spark code several times, congratulations, you have already implemented the first step of the spark job optimization, which is to reuse the RDD as much as possible. This is the basis for the second step of optimization, that is, to ensure that an RDD performs multiple operator operations, the RDD itself is only computed once.

The default principle for executing multiple operators for an RDD in Spark is this: every time you perform an operator operation on an RDD, it is recalculated from the source, computes the RDD, and executes your operator on the RDD. The performance of this method is very poor.

For this reason, our recommendation is to persist the RDD that is used multiple times. At this point, Spark will save the data in the RDD to memory or disk based on your persistence strategy. Each subsequent operator operation of the RDD will fetch the persisted RDD data directly from memory or disk and execute the operator without recalculating the RDD at the source and performing operator operations.

code example for persisting a multiple-use RDD
// 如果要对一个RDD进行持久化,只要对这个RDD调用cache()和persist()即可。// 正确的做法。// cache()方法表示:使用非序列化的方式将RDD中的数据全部尝试持久化到内存中。// 此时再对rdd1执行两次算子操作时,只有在第一次执行map算子时,才会将这个rdd1从源头处计算一次。// 第二次执行reduce算子时,就会直接从内存中提取数据进行计算,不会重复计算一个rdd。val rdd1 = sc.textFile("hdfs://192.168.0.1:9000/hello.txt").cache()rdd1.map(...)rdd1.reduce(...)// persist()方法表示:手动选择持久化级别,并使用指定的方式进行持久化。// 比如说,StorageLevel.MEMORY_AND_DISK_SER表示,内存充足时优先持久化到内存中,内存不充足时持久化到磁盘文件中。// 而且其中的_SER后缀表示,使用序列化的方式来保存RDD数据,此时RDD中的每个partition都会序列化成一个大的字节数组,然后再持久化到内存或磁盘中。// 序列化的方式可以减少持久化的数据对内存/磁盘的占用量,进而避免内存被持久化数据占用过多,从而发生频繁GC。val rdd1 = sc.textFile("hdfs://192.168.0.1:9000/hello.txt").persist(StorageLevel.MEMORY_AND_DISK_SER)rdd1.map(...)rdd1.reduce(...)

For the persist () approach, we can choose different persistence levels based on different business scenarios.

Spark's persistence level
Persistence level meaning explanation
memory_only saves data in memory using the Unformatted Java Object format. If the memory is not enough to hold all the data, the data may not be persisted. The next time you perform operator operations on this RDD, those data that are not persisted need to be recalculated from the source. This is the default persistence policy, which is actually used when using the cache () method.
memory_and_disk uses the unformatted Java object format to prioritize attempts to save data in memory. If the memory is not enough to hold all the data, the data is written to the disk file and the data persisted in the disk file is read out the next time the operator is executed on the RDD.
memory_only_ser basic meaning is the same as memory_only. The only difference is that the data in the RDD is serialized, and each partition of the RDD is serialized into a byte array. This way more memory is saved, which prevents persistent data from consuming too much memory leading to frequent GC.
memory_and_disk_ser basic meaning is the same as Memory_and_disk. The only difference is that the data in the RDD is serialized, and each partition of the RDD is serialized into a byte array. This way more memory is saved, which prevents persistent data from consuming too much memory leading to frequent GC.
disk_only writes all data to the disk file using the unformatted Java Object format.
memory_only_2, Memory_and_disk_2, and so on. for any of these persistence policies, if you add a suffix of _2, you are copying one copy of each persisted data and saving the copy to another node. This replica-based persistence mechanism is primarily used for fault tolerance. If a node is hung up, the node's memory or persistent data on disk is lost, then subsequent pair of RDD calculations can also use the data on other nodes of the copy. If there is no copy, it is only possible to recalculate the data from the source.
How to choose the most appropriate persistence strategy
    • By default, the highest performance is of course memory_only, but only if your memory must be large enough to hold all the data for the entire RDD. Because the performance cost of this part is avoided without serialization and deserialization, subsequent operator operations on this RDD are based on the operation of pure in-memory data, do not require data to be read from disk files, and do not need to replicate a copy of the data and transmit it remotely to other nodes. However, it must be noted that in the actual production environment, I am afraid that the scenario of direct use of this strategy is still limited, if the data in the RDD (such as billions of), directly with this persistence level, will cause the JVM's oom memory overflow exception.

    • If a memory overflow occurs when using the memory_only level, it is recommended that you attempt to use the Memory_only_ser level. This level serializes the RDD data and then saves it in memory, at which point each partition is just a byte array, greatly reducing the number of objects and reducing memory consumption. This level of performance overhead, which is more than memory_only, is mainly the overhead of serialization and deserialization. However, subsequent operators can operate on the basis of pure memory, so the overall performance is still relatively high. In addition, the problem that may occur is that if the amount of data in the RDD is too large, it can cause an oom memory overflow exception.

    • If the level of pure memory is not available, it is recommended that you use the Memory_and_disk_ser policy instead of the Memory_and_disk policy. Since this step, it shows that the data volume of the RDD is very large, the memory can not be completely put down. Less data is serialized, which can save memory and disk space overhead. At the same time, the policy will prioritize to try to cache the data in memory, and the memory cache will not be written to disk.

    • It is generally not recommended to use the disk_only and suffix _2 levels: Because data reads and writes based entirely on disk files, it can cause a dramatic decrease in performance, and sometimes it is better to recalculate all the rdd at once. The suffix is _2 level, all data must be copied one copy, and sent to other nodes, data replication and network transport can result in a large performance overhead, unless it is required for high availability of the job, it is not recommended.

Principle four: Try to avoid using the shuffle class operator

If possible, try to avoid using the shuffle class operator. Because the spark job is running, the shuffle process is where the most performance is consumed. Shuffle process, simply, is to distribute the same key on multiple nodes in the cluster, pull to the same node, do aggregation or join operations. such as Reducebykey, join and other operators, will trigger the shuffle operation.

During the shuffle process, the same key on each node is written to the local disk file, and the other nodes need to pull the same key from the disk file on each node through the network transport. And the same key is pulled to the same node for the aggregation operation, there may be too many keys processed on one node, resulting in insufficient memory storage, and then overflow to the disk file. Therefore, in the shuffle process, a large number of disk file read and write IO operations, as well as the data network transport operations can occur. Disk IO and network data transfer are also the main reasons for poor performance of shuffle.

Therefore, in our development process, we can avoid the use of reducebykey, join, DISTINCT, repartition and other shuffle operators, as far as possible to use the map class of non-shuffle operators. In this case, there are no shuffle operations or spark jobs with fewer shuffle operations, which can significantly reduce the performance overhead.

Example of a join code with map broadcast
// 传统的join操作会导致shuffle操作。// 因为两个RDD中,相同的key都需要通过网络拉取到一个节点上,由一个task进行join操作。val rdd3 = rdd1.join(rdd2)// Broadcast+map的join操作,不会导致shuffle操作。// 使用Broadcast将一个数据量较小的RDD作为广播变量。val rdd2Data = rdd2.collect()val rdd2DataBroadcast = sc.broadcast(rdd2Data)// 在rdd1.map算子中,可以从rdd2DataBroadcast中,获取rdd2的所有数据。// 然后进行遍历,如果发现rdd2中某条数据的key与rdd1的当前数据的key是相同的,那么就判定可以进行join。// 此时就可以根据自己需要的方式,将rdd1当前数据与rdd2中可以连接的数据,拼接在一起(String或Tuple)。val rdd3 = rdd1.map(rdd2DataBroadcast...)// 注意,以上操作,建议仅仅在rdd2的数据量比较少(比如几百M,或者一两G)的情况下使用。// 因为每个Executor的内存中,都会驻留一份rdd2的全量数据。
Principle five: Shuffle operations with map-side pre-aggregation

If it is necessary to use the shuffle operation because of the business need, cannot replace with the Map class operator, then try to use the operator that can map-side the pre-aggregation.

The so-called map-side pre-aggregation, which is said to be local to the same key in each node aggregation operation, similar to the local combiner in MapReduce. Once the map-side is pre-aggregated, there will only be one key locally for each node, since multiple identical keys are aggregated. When the other node pulls the same key on all nodes, it greatly reduces the amount of data that needs to be pulled, thus reducing disk IO and network transport overhead. In general, it is recommended to replace the Groupbykey operator with Reducebykey or Aggregatebykey operators where possible. Because both the Reducebykey and Aggregatebykey operators use user-defined functions to pre-aggregate the same key locally on each node. While the Groupbykey operator is not pre-aggregated, the full amount of data will be distributed and transmitted among the nodes of the cluster, the performance is relatively poor.

For example, the following two pictures, is a typical example, based on Reducebykey and groupbykey the word count. The first diagram is the Groupbykey schematic, you can see, no local aggregation, all the data will be transferred between the cluster nodes, the second graph is reducebykey schematic, you can see that each node local to the same key data, are pre-aggregated, It is then transferred to the other nodes for global aggregation.

Principle VI: Using high-performance operators

Besides the optimization principle of shuffle related operators, other operators have corresponding optimization principles.

Replacing Groupbykey with Reducebykey/aggregatebykey

See "Principle Five: Using map-side pre-aggregated shuffle operations" for details.

Use mappartitions instead of normal map

Mappartitions classes of operators, a function call will process a partition all data, rather than a single function call processing one, performance is relatively higher. Sometimes, however, there is an oom (memory overflow) problem with mappartitions. Because a single function call will dispose of a partition all the data, if the memory is not enough, garbage collection is unable to reclaim too many objects, it is likely that oom exception. So be cautious when using this type of operation!

Using Foreachpartitions instead of foreach

The principle is similar to "use mappartitions instead of map", and it is also a function call that processes all the data of a partition, rather than a single function call processing a single piece of data. In practice, it is found that the operator of Foreachpartitions class is very helpful to improve the performance. For example, in the Foreach function, all the data in the RDD is written to MySQL, then if it is a normal foreach operator, a piece of data is written, each function call may create a database connection, it is bound to frequently create and destroy database connections, performance is very low But if the foreachpartitions operator is used to process one partition data at a time, the performance is higher for each partition, as long as a database connection is created, and then a bulk insert operation is performed. In practice, it is found that the performance can be increased by more than 30% for 10,000 or so of the data volume written by MySQL.

Coalesce operation after using filter

Usually after filtering out more data in the RDD (for example, more than 30% of data) for an RDD filter operator, we recommend using the COALESCE operator to manually reduce the number of partition in the RDD and compress the data in the RDD into fewer partition. Because after the filter, each partition in the RDD will have a lot of data to be filtered out, at this time if the subsequent calculation, in fact, each task processing partition in the amount of data is not many, a bit of resource waste, and at this time the more tasks processed, The slower it may be. As a result, by reducing the number of partition with coalesce and compressing the data in the RDD to less partition, all partition can be processed with fewer tasks. In some scenarios, performance improvements can be helpful.

Use Repartitionandsortwithinpartitions to replace repartition with sort class operations

Repartitionandsortwithinpartitions is a recommended operator on the Spark website, and it is officially recommended that if you need to sort the repartition after partitioning, It is recommended to use the repartitionandsortwithinpartitions operator directly. Because the operator can perform the shuffle operation on one side of the partition and sort it on one side. Shuffle with sort two operations at the same time, the performance may be higher than the first shuffle and then the sort.

Principle Seven: Broadcast large variables

Sometimes in the development process, you encounter scenarios where you need to use external variables in operator functions (especially large variables, such as large collections above 100M), then you should use the Radio (broadcast) feature of Spark to improve performance.

When used in an operator function to an external variable, by default, spark copies multiple copies of the variable and transmits it over the network to the task, where each task has a copy of the variable. If the variable itself is larger (such as 100M, or even 1G), the performance overhead of a large number of variable replicas in the network, and the frequent GC that takes up too much memory in the executor of each node, can greatly affect performance.

For this reason, if you use an external variable that is larger, it is recommended that you broadcast the variable using the broadcast function of Spark. The post-broadcast variable guarantees that each executor memory will reside in only one copy of the variable, while the task in executor shares that copy of the variable in the executor. In this way, the number of variable replicas can be greatly reduced, reducing the performance overhead of network transmissions, and reducing the overhead of executor memory and the frequency of GC.

code example for broadcast large variables
// 以下代码在算子函数中,使用了外部的变量。// 此时没有做任何特殊操作,每个task都会有一份list1的副本。val list1 = ...rdd1.map(list1...)// 以下代码将list1封装成了Broadcast类型的广播变量。// 在算子函数中,使用广播变量时,首先会判断当前task所在Executor内存中,是否有变量副本。// 如果有则直接使用;如果没有则从Driver或者其他Executor节点上远程拉取一份放到本地Executor内存中。// 每个Executor内存中,就只会驻留一份广播变量副本。val list1 = ...val list1Broadcast = sc.broadcast(list1)rdd1.map(list1Broadcast...)
Principle eight: Use Kryo to optimize serialization performance

In Spark, there are three main places involved in serialization:

    • When an external variable is used in an operator function, the variable is serialized and then transmitted over the network (see "Principle seven: Broadcast big variable").
    • When you use a custom type as a generic type for an RDD (for example, Javardd,student is a custom type), all custom type objects are serialized. Therefore, in this case, it is also required that the custom class must implement the Serializable interface.
    • When using a serializable persistence policy (such as Memory_only_ser), spark serializes each partition in the RDD into a large byte array.

For these three types of serialization, we can optimize the performance of serialization and deserialization by using the Kryo serialization class library. By default, Spark uses the Java serialization mechanism, which is the Objectoutputstream/objectinputstream API, for serialization and deserialization. However, Spark supports the use of the Kryo serialization library, and the performance of the Kryo serialization class library is much higher than that of the Java Serialization Class library. Official introduction, Kryo serialization mechanism than the Java serialization mechanism, performance is about 10 times times higher. The reason that spark does not use Kryo as a serialized class library by default is because Kryo requires the best way to register all custom types that need to be serialized, which is cumbersome for developers.

The following is a code example that uses Kryo, as long as you set the serialization class, and then register the custom type that you want to serialize (such as the type of external variable used in the operator function, the custom type as the RDD generic type, and so on):

// 创建SparkConf对象。val conf = new SparkConf().setMaster(...).setAppName(...)// 设置序列化器为KryoSerializer。conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")// 注册要序列化的自定义类型。conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
Principle IX: Optimizing Data Structures

In Java, there are three types of memory that are more expensive:

    • object, each Java object has additional information such as object headers, references, and so on, so it is more memory-intensive.
    • String, with an array of characters inside each string, along with additional information such as length.
    • Collection types, such as HashMap, LinkedList, and so on, because the collection type internally typically uses some inner classes to encapsulate the collection elements, such as Map.entry.

As a result, spark officially recommends that in the spark code implementation, especially for code in operator functions, try not to use the above three data structures, try to use string substitution objects, use primitive types (such as int, Long) instead of strings, use arrays instead of collection types, This reduces the memory footprint as much as possible, thus reducing the GC frequency and improving performance.

But in the author's coding practice found that to achieve this principle is not easy. Because we also have to consider the maintainability of the code, if in a code, there is no object abstraction, all is the way string concatenation, then for the subsequent maintenance and modification of the code is undoubtedly a huge disaster. Similarly, if all operations are based on an array implementation and do not use collection types such as HashMap, LinkedList, it is also a great challenge for our coding difficulties and code maintainability. Therefore, it is recommended to use less memory-intensive data structures where possible and appropriate, but only if the maintainability of the code is guaranteed.

Overview of resource tuning and tuning

After the spark job has been developed, it is time to configure the appropriate resources for the job. The resource parameters of spark can all be set as parameters in the Spark-submit command. Many spark beginners, usually do not know which parameters to set, and how to set these parameters, and finally can only be randomly set, or even not set at all. Unreasonable resource parameter settings may result in the failure to make full use of the cluster resources, the job will run extremely slowly, or the set of resources is too large, the queue does not have sufficient resources to provide, resulting in various exceptions. In short, in either case, the spark job will be inefficient or even impossible to run. Therefore, we must have a clear understanding of the resource usage principles of the spark job and know which resource parameters can be set during the spark job run, and how to set the appropriate parameter values.

Basic operating principle of spark job

See detailed principles. After we use Spark-submit to submit a spark job, the job starts a corresponding driver process. Depending on the deployment mode (Deploy-mode) you are using, the driver process may start locally or on a working node in the cluster. The driver process itself occupies a certain amount of memory and CPU core based on the parameters we set. The first thing that the driver process does is to Cluster Manager (which can be a spark standalone cluster or other resource management cluster, and the Volkswagen review uses yarn as a resource management cluster) to request resources to run the spark job. The resources here refer to the executor process. Yarn Cluster Manager launches a certain number of executor processes on each work node based on the resource parameters we set for the spark job, each of which occupies a certain amount of memory and CPU core.

After applying for the resources required to execute the job, the driver process starts dispatching and executing the job code we wrote. The driver process splits the spark job code we write into multiple stages, each stage executes a portion of the code fragment, creates a batch of tasks for each stage, and then assigns those tasks to the individual executor processes to execute. Task is the smallest unit of computation that executes exactly the same computational logic (that is, a piece of code we write ourselves), except that the data processed by each task is different. After all the tasks of a stage are executed, the computed intermediate results are written to the disk files locally on each node, and then the driver is dispatched to run the next stage. The input data for the task on the next stage is the intermediate result of the previous stage output. This repeats itself until all the code logic we have written is executed, and all of the data is calculated to get the results we want.

Spark is based on the shuffle class operator to divide the stage. If a shuffle class operator (such as Reducebykey, join, etc.) is executed in our code, a stage boundary is divided at that operator. It can be broadly understood that the code before the execution of the shuffle operator is divided into a stage,shuffle operator execution and the subsequent code is divided into the next stage. So when a stage starts executing, each of its tasks may go from the node on the previous stage's task to the network transport to pull all the keys that need to be processed by itself. We then use our own operator functions to perform aggregate operations (such as functions received by the Reducebykey () operator) for all the same keys pulled. This process is shuffle.

When we perform persistent operations such as cache/persist in code, the data computed by each task is saved to the memory of the executor process or to the disk file of the node in which it resides, depending on the persistence level we choose.

So executor's memory is mainly divided into three blocks: the first block is used when the task executes our own code, which is 20% of total executor memory, and the second is when the task pulls the output of the task from the previous stage through the shuffle process. For operations such as aggregation, the default is 20% of total executor memory, and the third block is used when the RDD is persisted, which accounts for 60% of the total executor memory by default.

The execution speed of a task is directly related to the number of CPU cores per executor process. A CPU core can only execute one thread at a time. Multiple tasks that are assigned to each executor process are run in a multithreaded manner, in the same way as each task thread. If the number of CPU cores is sufficient and the number of assigned tasks is reasonable, it is generally possible to perform these task threads more quickly and efficiently.

The above is a description of the basic operating principles of spark operations, which can be combined to understand. Understanding the basic principles of the operation is the fundamental premise of our resource parameter tuning.

Resource parameter tuning

Once you understand the fundamentals of the spark job run, the parameters related to the resource are easy to understand. The so-called Spark resource parameter tuning, in fact, is the spark in the process of running the various resources used in the place, by adjusting various parameters to optimize the efficiency of resource use, thereby improving the performance of spark job. The following parameters are the main resource parameters in Spark, each of which corresponds to a part of the operating principle of the job, and we also give a tuning reference value.

Num-executors
    • Parameter description: This parameter is used to set the total number of executor processes to be executed by the spark job. Driver when you request a resource from the Yarn Cluster Manager, yarn Cluster Manager starts the appropriate number of executor processes on each work node of the cluster as you set it. This parameter is very important, if not set, the default will only give you to start a small number of executor process, at this time your spark job is running very slow.
    • Parameter Tuning Recommendations: Run general settings for each spark job 50~100 The executor process is appropriate, setting too little or too many executor processes is not good. Too few settings to make full use of cluster resources; too many of the queues may not be able to provide sufficient resources.
Executor-memory
    • Parameter description: This parameter is used to set the memory for each executor process. Executor the size of the memory, many times directly determines the performance of the spark job, and with the common JVM Oom exception, there is also a direct association.
    • Parameter Tuning Recommendations: The memory settings 4g~8g for each executor process are more appropriate. But this is only a reference value, the specific settings will have to be based on the resource queue of different departments. You can see what the maximum memory limit for your team's resource queue is, and num-executors times Executor-memory, which represents the total amount of memory your spark job is applying to (that is, the sum of all executor processes). This volume cannot exceed the maximum amount of memory in the queue. In addition, if you are sharing this resource queue with other people on your team, it is best not to exceed the total memory of the resource queue by 1/3~1/2 your own spark job that consumes all of the resources in the queue, causing the other students ' jobs to fail to run.
Executor-cores
    • Parameter description: This parameter is used to set the number of CPU cores per executor process. This parameter determines the ability of each executor process to execute the task thread in parallel. Because each CPU core can execute only one task thread at a time, the higher the number of CPU cores per executor process, the faster it can execute all the task threads assigned to it.
    • Parameter Tuning Recommendations: The Executor CPU core number is set to one to four. Also depending on the resource queue of the different departments, you can look at the maximum CPU core limit of your resource queue, and then depending on the number of executor set, each executor process can be assigned to several CPU cores. It is also recommended that if you share this queue with others, then num-executors * executor-cores do not exceed the queue total CPU core 1/3~1/2 around the appropriate, but also to avoid affecting other students of the job run.
Driver-memory
    • Parameter description: This parameter is used to set the memory of the driver process.
    • Parameter Tuning Recommendations: Driver memory is usually not set, or set to around 1G should be enough. The only thing to note is that if you need to use the Collect operator to pull all the data from the RDD to driver for processing, you must make sure that the driver memory is large enough to cause an oom memory overflow problem.
Spark.default.parallelism
    • Parameter description: This parameter sets the default task number for each stage. This parameter is extremely important if not set and may directly affect your spark job performance.
    • Parameter Tuning Recommendations: The default task number for spark jobs is 500~1000. A lot of students often make a mistake is not to set this parameter, then it will lead to spark itself according to the number of blocks in the underlying HDFS to set the number of tasks, by default, an HDFS block corresponding to a task. In general, the number of default settings for Spark is small (for example, dozens of tasks), and if the number of tasks is small, the parameters of the executor that you set earlier will be wasted. Imagine, no matter how many of your executor processes, memory and CPU, but the task is only 1 or 10, then 90% of the executor process may not have task execution at all, that is wasted resources! So the Spark website recommends setting the principle that setting this parameter to Num-executors * Executor-cores is more appropriate, such as executor total CPU core number is 300, then set 1000 task is possible, The resources of the spark cluster can be fully exploited at this time.
Spark.storage.memoryFraction
    • Parameter description: This parameter is used to set the ratio of the RDD persisted data to executor memory, which is 0.6 by default. That is, the default executor 60% of memory, can be used to save persisted RDD data. Depending on the persistence policy you choose, the data may not persist if there is not enough memory, or the data will be written to disk.
    • Parameter Tuning Recommendations: If there are more RDD persistence operations in the spark job, the value of the parameter can be increased appropriately to ensure that persisted data can be accommodated in memory. Avoid insufficient memory to cache all data, resulting in data being written to disk only, reducing performance. However, if the shuffle class operation in the spark job is more, and the persistence operation is relatively small, the value of this parameter is appropriately reduced. In addition, if the discovery job is slow to run due to frequent GC (the GC of the job can be observed through the Spark Web UI), which means that the task does not have enough memory to execute user code, it is also recommended to lower the value of this parameter.
Spark.shuffle.memoryFraction
    • Parameter description: This parameter is used to set the ratio of executor memory that can be used when a task is pulled to the output of a task in the previous stage during the shuffle process, and the default is 0.2. That is, executor defaults to only 20% of the memory used for this operation. When the shuffle operation is aggregated, if it finds that the memory used exceeds the 20% limit, the excess data is spilled into the disk file, which can greatly degrade performance.
    • Parameter Tuning Recommendations: If there are fewer rdd persistence operations in spark jobs and shuffle operations, it is recommended to reduce the memory footprint of the persisted operation, increase the ratio of memory to the shuffle operation, and avoid insufficient memory when the data is too high in the shuffle process, and must be spilled to disk. Reduced performance. In addition, if the discovery job is running slowly due to frequent GC, which means that the task does not have enough memory to execute user code, it is also recommended to lower the value of this parameter.

The tuning of the resource parameters, without a fixed value, requires the students to follow their own realities (including the number of shuffle operations in the spark job, the number of RDD persistence operations, and the job GC shown in the Spark Web UI), as well as referring to the principles and tuning recommendations presented in this article. Set the above parameters reasonably.

Resource parameter Reference example

Here is an example of the Spark-submit command, you can refer to, and according to their actual situation to adjust:

./bin/spark-submit   --master yarn-cluster   --num-executors 100   --executor-memory 6G   --executor-cores 4   --driver-memory 1G   --conf spark.default.parallelism=1000   --conf spark.storage.memoryFraction=0.5   
Written in the last words

According to practical experience, most of the spark operations after the development tuning and resource tuning explained in this basic article, generally can run with high performance, enough to meet our needs. However, in different production environments and project backgrounds, you may encounter other more difficult issues (such as various data skew), and you may experience higher performance requirements. In order to meet these challenges, more advanced techniques are needed to deal with such problems. In the following "Spark Performance optimization Guide-Advanced", we will explain the data tilt tuning and shuffle tuning in detail.

Spark Performance Tuning Guide-Basics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.