Spark Performance Tuning

Last Update:2016-12-16 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

first: Improve the degree of parallelismThe degree of parallelism is the number of tasks in the spark job, each stage, which represents the degree of parallelism of the spark job at various stages (stage). What happens if you don't adjust the degree of parallelism, which leads to low parallelism? Suppose, now that we have allocated enough resources to our spark job in the Spark-submit script, such as 50 executor, each executor has 10G of memory, and each executor has 3 CPU cores. Basically, the resource limit of the cluster or yarn queue has been reached. The task is not set, or is set to a few, such as set, 100 task. 50 executor, each executor has 3 CPU cores, that is, your application any stage when running, there are total in 150 CPU core, can run in parallel. But now that you have only 100 tasks, with an average allocation, and each executor assigned to 2 Task,ok, there are only 100 tasks running at the same time, and each executor will only run 2 tasks in parallel. The remaining CPU core for each executor is wasted. A reasonable degree of parallelism should be set to be large enough to fully utilize your cluster resources, such as the above example, a total of 150 CPU cores in a cluster, which can run 150 tasks in parallel. Then you should set your application degree of parallelism, at least 150, in order to fully effectively utilize your cluster resources, allow 150 tasks to execute in parallel, and increase the task to 150, that is, you can run parallel at the same time, and you can make less data for each task to process , such as a total of 150G of data to be processed, if it is 100 task, each task calculates 1.5G of data, now increased to 150 task, can run in parallel, and each task mainly processing 1G of data can be. 1. The number of tasks is set to at least the same number of total CPU cores as the spark application (ideally, for example, a total of 150 CPU cores, 150 tasks allocated, running together, running at almost the same time). 2, the official is recommended, the number of tasks, set to spark application total CPU core number of three times, such as 150 CPU core, basic to set the number of tasks to 300~500.
Because the actual situation, and the ideal situation, some tasks will run faster, such as 50s is over, some tasks, may be slower, to 1 minutes and a half to run, so if your task number, just set the same number of CPU core, may still lead to waste of resources, because, For example, 150 task,10 first run out, the remaining 140 are still running, but this time, there are 10 CPU core is free, resulting in waste. that if the number of tasks is set to a total of one or more CPU cores, then once a task has run out, another task can immediately fill up, as much as possible to make the CPU core not idle, but also to maximize the efficiency and speed of the spark operation, improve performance. 3, how to set the degree of parallelism of a spark application? spark.default.parallelismSparkConf conf = new sparkconf () conf.set ("Spark.default.parallelism", "500")

Second, resource allocation optimization

Spark's allocation of resources is mainly executor, CPU per executor, memory per executor, driver memory, and so on, in our production environment, when submitting the spark job, with the Spark-submit Shell script, which adjusts the corresponding parameters:

/usr/local/spark/bin/spark-submit \

--class cn.spark.sparktest.core.WordCountCluster \--num-executors 3 \ Configure the number of executor--driver-memory 100m \ Configure driver memory (less impact)--executor-memory 100m \ Configure the memory size of each executor--executor-cores 3 \ Configure the number of CPU cores per executor/usr/local/ Sparktest-0.0.1-snapshot-jar-with-dependencies.jar \
First of all to understand your machine's resources, how much memory, how many CPU core, according to the actual situation to set, can use how many resources, as far as possible to adjust to the largest size (executor number, dozens of to hundreds of different; executor memory; Executor CPU CORE). Spark Standalone mode, if each machine available memory is 4g,2 CPU core,20 machine, that can be set: 20 executor, each EXECUTOR4G Memory 2 CPU core. In yarn mode, based on the resource queue resources that Spark wants to submit, if the queue resource is 500G memory, 100 CPU cores, that can be set: 50 executor, 2 CPU cores per executor10g memory. After adjusting the resources, Sparkcontext,dagscheduler,taskscheduler will cut our operators into a large number of tasks and commit them to application executor. Increasing the CPU core of each executor also increases the parallel capability of execution. Originally 20 Executor, each only 2 CPU core. The number of tasks that can be executed in parallel is 40 tasks. Now each executor CPU core is increased to 5. The number of tasks that can be executed in parallel is 100 tasks. The speed of execution increased by 2.5 times times. If the number of executor is small, then the number of tasks that can be executed in parallel means that our application's ability to execute in parallel is weak. For example, there are 3 executor, and each executor has 2 CPU cores, so there are 6 tasks that can be executed concurrently. 6 after execution, replace a batch of 6 tasks. When you increase the number of executor, it means that the number of tasks that can be executed in parallel is much more. For example, the original 6, can now be executed in parallel 10, or even 20, 100. Then the parallel capability is dozens of times times higher than it was before. Correspondingly, the performance (speed of execution) can also be increased several times ~ dozens of times times. Increase the amount of memory per executor. Increase the amount of memory, the performance of the increase, there are two points: 1, if you need to cache the RDD, then more RAM, you can cache more data, write less data to disk, or even write to disk. Reduced disk IO. 2, for shuffle operation, the reduce side, will need memory to hold the pulled data and aggregation. If there is not enough memory, it is also written to disk. If you give execUtor allocates more memory, there is less data that needs to be written to disk or even written to disk. Reduced disk IO and improved performance. 3. For task execution, many objects may be created. If the memory is small, it can frequently cause the JVM heap memory to be full, and then frequent GC, garbage collection, minor GC, and fully GC. (Very slow). After the memory is enlarged, it brings less GC, garbage collection, avoids the slow speed, and the speed becomes faster.

Third, RDD persistence or caching

When the first execution operator on the RDD2, get RDD3, it will start from the RDD1 calculation, is to read the HDFs file, and then the RDD1 operator, get to RDD2, and then calculate, get RDD3

By default, an RDD operator is executed multiple times to obtain a different rdd, and the RDD and the previous parent Rdd are recalculated all at once; Read Hdfs->rdd1->rdd2-rdd4
This situation, is absolutely absolutely, must be avoided, once the case of an RDD repeated calculation, it will lead to a sharp decrease in performance.

For example, the time of the HDFS->RDD1-RDD2 is 15 minutes, then it will go two times, become 30 minutes

In another case, from an rdd to several different rdd, operators and computational logic are actually exactly the same, resulting from human negligence, which has been calculated several times and obtained multiple rdd.

Therefore, it is recommended that the following methods be used to optimize:

First, RDD architecture refactoring and optimization
Try to re-use the RDD, almost the RDD, can be extracted called a common Rdd, for the back of the RDD calculation when used repeatedly.

Second, the public rdd must be sustainable

Persistence, that is, the data of the RDD is cached in memory/disk, (Blockmanager), regardless of how many times the RDD is calculated, it is directly take the RDD data persisted, such as from memory or disk, directly extract a copy of the data.

Third, persistence, which can be serialized

If the data is persisted in memory, it may lead to excessive memory consumption, which might result in an oom memory overflow.

When pure memory does not support the full storage of public rdd data, it is preferable to use serialization to store in pure memory. The data of each partition of the RDD is serialized into a large byte array, one object, and after serialization, the memory footprint is greatly reduced.

The only disadvantage of serialization is that it needs to be deserialized when it gets the data.

If you serialize pure memory mode, or cause oom, memory overflow, you can only consider the way disk, memory + disk in the normal way (no serialization). Memory + disk, serialization.

Four, for the high reliability of data, and sufficient memory, you can use the double copy mechanism, to persist

Persistent double copy mechanism, a persistent copy, because the machine goes down, the copy is lost, it still has to be recalculated; each persisted unit of data is stored in a copy and placed on top of the other nodes; a copy is lost without recalculation and another copy can be used. This way, only for your memory resources is extremely sufficient.

Sessionid2actionrdd = Sessionid2actionrdd.persist (Storagelevel.memory_only ());
/*** persistence, very simple, is to call the persist () method on the Rdd, and pass in a persistence level * if it is persist (storagelevel.memory_only ()), pure memory, no serialization, then you can use the cache () method to replace * Storagelevel.memory_only_ser (), Second choice * Storagelevel.memory_and_disk (), Third choice * Storagelevel.memory_and_disk_ser (), SELECT * Storagelevel.disk_only (), fifth SELECT * * If memory is sufficient, to use double copy high reliability mechanism * Select suffix with _2 policy * storagelevel.memory_only_2 () * */ Sessionid2actionrdd = Sessionid2actionrdd.persist (Storagelevel.memory_only ());

IV: Using broadcast variables

Broadcast variables,is actually Sparkcontext's broadcast () method, passing in the variable you want to broadcast, you canFinal broadcast<map<string, map<string, intlist>>> broadcast = Sc.broadcast ( FASTUTILDATEHOUREXTRACTMAP); When you use a broadcast variable, call the value ()/GetValue () of the broadcast variable (broadcast type) directly, and you can get to the broadcast variable map<string that was previously encapsulated, MAP <string, intlist>> Datehourextractmap = broadcast. Value ();
Like randomly extracted map,1m, for example. It's still small. If you read some dimension data from which table, let's say, the information for all commodity categories is used in some operator function. 100M. 1000 of tasks. 100G of data, network transmission. The cluster instantly consumes 100G of memory for this reason.
In this default, task-executed operator, an external variable is used, and each task acquires a copy of the variable, what are the drawbacks? Under what circumstances will there be a bad performance effect? Map, itself is not small, a unit of data storage is entry, it is possible to use the format of the list to store the entry chain. So map is a data format that consumes memory more.
For example, map is 1M. In total, you tuned in front of the special good, resources to the place, with the resources, the degree of parallelism to adjust the absolute in place, 1000 task. A large number of tasks are indeed running in parallel. These tasks use a map that occupies 1 m of memory, so first the map copies 1000 copies, which are transferred over the network to each task and used for the task. A total of 1G of data will be transmitted over the network. Network transmission overhead, not optimistic Ah!!! Network transmission, it may consume a fraction of the total time your spark job is running.
Map copies, which are transferred to each task, are memory-intensive. 1 maps are really small, and 1m;1000 maps are distributed across your cluster, consuming 1G of memory in a single swoop. What is the impact on performance? Unnecessary memory consumption and occupation, resulting in, you are in the RDD persistent to the memory, you may not be able to completely put down in memory, you can write to disk, and finally cause the subsequent operation on disk IO consumption performance;
When your task is creating objects, you may find that heap memory is not enough to fit all objects, and may result in frequent garbage collector recycling, GC. GC, it must have caused the worker thread to stop, which is a little bit of time for spark to suspend work. Frequent GC will have a considerable impact on the speed of the spark operation.

If you say that task uses a large variable (1m~100m), knowing that the word will cause a bad performance impact. So how do we fix it? Broadcast, broadcast, broadcast big variables. Instead of being used directly. The benefit of broadcast variables is not a copy of the variable for each task, but a copy of each node's executor. In this way, the copy produced by the variable can be greatly reduced. Broadcast variable, at the beginning, there is a copy on the drvier. When the task is running, it wants to use the data from the broadcast variable, and first tries to get a copy of the variable in the corresponding Blockmanager of its local executor, if not locally, Blockmanager, You might get a copy of the variable from the remote driver, or you might get it from the executor Blockmanager of the other node near it, and save it in the local blockmanager. Blockmanager is responsible for managing the memory and the data on the disk for a executor, and thereafter the task on this executor will use the local copy of Blockmanager directly.
For example, 50 executor,1000 a task. A map,10m: By default, 1000 copies of the task,1000 copy. 10G of data, network transmission, in the cluster, consumes 10G of memory resources. If a broadcast variable is used. 50 copies of execurtor,50. 500M of data, network transmission, and not necessarily all from driver to each node, or the nearest node from the executor Bockmanager to pull the variable copy, network transmission speed greatly increased; 500M memory consumption.

V: Using Kryo serialization

Set a property in Sparkconf, Spark.serializer,org.apache.spark.serializer.kryoserializer class, some custom classes that you use to register, that need to be serialized by Kryo, sparkconf.registerkryoclasses () Sparkconf.set ("Spark.serializer", "Org.apache.spark.serializer.KryoSerializer"). Registerkryoclasses (New class[ ]{categorysortkey.class}) The reason why Kryo is not being used as the default serialization class library is that it will occur: mainly because Kryo requirements, if you want to achieve its best performance, then you must register your custom class (for example, When you use an object variable of an external custom type in your operator function, you are required to register your class, otherwise kryo will not achieve the best performance.
When a serialized persistence level is used, the efficiency and performance of serialization is further optimized using Kryo when serializing each RDD partition into a large byte array. By default, spark internally is serialized using the Java serialization mechanism, Objectoutputstream/objectinputstream, an object input and output stream mechanism. The advantage of this default serialization mechanism is that it is easier to handle, and does not require us to do anything manually, but that the variables you use in the operator must be implemented serializable interfaces, serializable. However, the disadvantage is that the default serialization mechanism is inefficient, the serialization speed is slow, the data after serialization, the memory space occupied is relatively large. Spark supports the use of the Kryo serialization mechanism. Kryo serialization mechanism, faster than the default Java serialization mechanism, the serialized data is smaller, presumably the Java serialization mechanism of 1/10 。 As a result, Kryo serialization optimization allows the network to transmit less data, and the memory resources consumed in the cluster are greatly reduced. When the shuffle operation of a task between the stages is performed, the task between the node and the node is heavily pulled and transferred through the network, and the data is transmitted over the network and may be serialized, using Kryo.
The Kryo serialization mechanism, once enabled, will take place in several places:
1, the external variables used in operator functions, after using Kryo: Optimize the performance of network transmission, can optimize the memory in the cluster usage and consumption
2. Persistent Rdd,storagelevel.memory_only_ser optimizes memory consumption and depletion; the less memory the persistent rdd consumes, the more objects it creates when the task executes,
Do not occupy the memory frequently, the GC frequently occurs. 3, Shuffle: Can optimize the performance of network transmission.

Sixth: Data localization

Localization level

Process_local: Process localization, code and data in the same process, that is, in the same executor; the task of calculating data is performed by executor, and the data is in Blockmanager of executor; performance is best.

Node_local: node localization, code and data in the same node; For example, the data is on the node as an HDFs block, and the task runs in a executor on the node, or the data and the task are in different executor on one node ; Data needs to be transferred between processes
No_pref: For task, where the data is obtained from the same, there is no good or bad points
Rack_local: Rack localization, data and task on two nodes of a rack; data needs to be transmitted across the network between nodes
Any: Data and tasks may be anywhere in the cluster, and not in one rack, with the worst performance

Spark.locality.wait, default is 3s

Spark, on driver, calculates which shard data each task is to compute, and the task assignment algorithm for one of the partition;spark of the RDD, before assigning the task to each stage of the application, whichever is the first, You will want each task to be allocated exactly to the node where the data it is to compute, so that it does not transfer data between the networks;

However, the task may not have the opportunity to allocate the node where its data resides, because the compute resources and computing power of that node may be full; So, at this time, usually, spark will wait for a while, by default it is 3s (not absolute, there are many cases, for different localization levels, Will wait), in the end, is not waiting, will choose a relatively poor localization level, for example, the task is allocated to rely on the data it is to calculate the node, a relatively close to a node, and then calculate.

However, for the second case, it is usually the data transfer, the task will get the data through the Blockmanager of its node, Blockmanager found that there is no data locally, through a getremote () method, Data is obtained from the Blockmanager of the node in which the data resides through the Transferservice (Network data transfer component), which is transmitted back to the node where the task resides.

For us, of course, we do not want to be similar to the second case. The best, of course, is the task and data on one node, directly from the local executor of the Blockmanager to get data, pure memory, or with a bit of disk IO, if you want to transfer data over the network, then really, performance will be degraded, a lot of network transmission, and disk IO, Are the killers of performance.

Time to adjust this parameter?

Observe the log, the operation of the spark job log, recommended that everyone in the test, the first use of the client mode, the local directly can see the full log.
The log will show that starting task ... , PROCESS Local, NODE local, and observe the data localization level for most tasks.

If it's mostly process_local, it doesn't have to be adjusted.
If it is found that many of the levels are node_local, any, then it is better to adjust the data localization wait time
Adjustment, should be repeated adjustment, after each adjustment, then to run, observe the log
See if the localization level of most tasks has been boosted; see if the uptime of the entire spark job is shortened

But be careful not to get the cart before the horse, the localization level is improved, but because of a lot of waiting time, the spark operation time increases, it is still not adjusted.

Spark.locality.wait, default is 3s; can be changed to 6s,10s

By default, the following 3 wait lengths are the same as the one above, all 3s
Spark.locality.wait.process
Spark.locality.wait.node
Spark.locality.wait.rack

New Sparkconf ()
. Set ("Spark.locality.wait", "10")

Spark Performance Tuning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More