Content:
1, Spark performance optimization needs to think about the basic issues;
2, CPU and memory;
3. Degree of parallelism and task;
4, the network;
========== Liaoliang daily Big Data quotes ============
Liaoliang daily Big Data quotes Spark 0080 (2016.1.26 in Shenzhen): If the CPU usage in spark is not high enough, consider allocating more executor to the current program, or adding more worker instances to fully utilize multicore potential.
Liaoliang daily Big Data quotes Spark 0079 (2016.1.26 in Shenzhen): It is important to set the number of partition shards appropriately, too few partition shards may cause oom and frequent GC because of the large amount of data per partition. Excessive partition shard data can be inefficient because each partition data volume is too small.
Liaoliang daily Big Data quotes Spark 0078 (2016.1.23 in Shenzhen): One way to boost spark hardware, especially CPU usage, is to increase executor parallelism, but if executor too much, Directly allocated in each executor memory is greatly reduced, the operation of memory is reduced, disk-based operations are more and more, resulting in poor performance.
Liaoliang daily Big Data quotes Spark 0077 (2016.1.23 in Shenzhen): If you find it easier to memory overflow when dealing with spark job, another effective way is to reduce the number of executor in parallel, This allows each executor to allocate more memory, which increases the amount of memory used by each task and reduces the risk of oom.
Liaoliang daily Big Data quotes Spark 0076 (2016.1.23 in Shenzhen): When dealing with spark jobs, if you find it more prone to memory overflow, a more effective way is to increase the parallelism of the task, so that the amount of data partition per task is reduced. , reducing the likelihood of oom.
Liaoliang daily Big Data quotes Spark 0075 (2016.1.23 in Shenzhen): When working with spark jobs, if you find that some tasks run particularly slowly, another way to do this is to increase the number of parallel executor, Thus, the computing resources allocated by each executor are reduced, which improves the overall efficiency of the hardware usage.
Liaoliang daily Big Data quotes Spark 0074 (2016.1.23 in Shenzhen): When working with spark jobs if you find that some tasks are running particularly slowly, you should consider increasing the degree of parallelism and reducing the amount of data per partition to improve execution efficiency.
Liaoliang daily Big Data quotes Spark 0073 (2016.1.23 in Shenzhen): If there are very many small files in the process of handling the spark job, you can reduce the number of partition by coalesce. This reduces the number of tasks in parallel and reduces the amount of task creation, thus improving the efficiency of the use of the hardware.
Liaoliang daily Big Data quotes Spark 0072 (2016.1.22 in Shenzhen): By default, Spark's executor will occupy as much of the core as possible on the current machine, one of the benefits of which is to maximize the degree of parallelism of the computation. Reduces the number of batches in a job that a task runs, but one risk is that if each task consumes more memory, it requires frequent spill over or a greater risk of oom.
Liaoliang daily Big Data quotes Spark 0071 (2016.1.22 in Shenzhen): The spark cluster has only one worker on each host by default, and each worker defaults to assigning only one executor to the current application to execute the task. However, by configuring spark-env.sh, you can have several workers on each host, and there can be several executor under each worker.
Liaoliang daily Big Data quotes Spark 0070 (2016.1.22 in Shenzhen): Inside the Spark stage is a set of task components that compute exactly the same distributed parallel operation with different data, and the internal computations are performed in a pipeline manner. The only way to produce shuffle is between different stages.
Liaoliang daily Big Data quotes Spark 0069 (2016.1.21 in Shenzhen): In Spark, it is possible to consider using SSDs on worker nodes and saving the worker's shuffle structure to RAM disk in a way that greatly improves performance.
==========spark Performance Optimization Core Cornerstone ============
1, Spark is the use of Master-slaves model for resource management and task execution management;
1) Resource management: Master-workers, can have multiple Workers on a machine;
2) Task execution: driver-executors, when assigning multiple workers on a single machine, each worker assigns a executor to the currently running application by default. But we can modify the configuration so that each worker assigns several executor to our current application, which is divided into several stages when the program runs (there is no shuffle inside the stage, and the stage is divided when the shuffle is encountered). Each stage contains several tasks that have the same processing logic, with different knowledge processing data, and tasks that are assigned to executor for execution;
2, in Spark can consider the use of solid state disk on the worker node and the shuffle structure of the worker to save to ram disk way to greatly improve performance;
3, by default, Spark's executor will occupy as much of the core as possible on the current machine, one of the benefits is to maximize the parallelism of the computation, reduce the batch of the task running in a job, but one risk is that if each task consumes more memory, There is a need for frequent spill over or more oom risk;
4, when you often find the machine frequently oom, you can consider one way is to reduce the degree of parallelism, so the same memory space parallel operation of the task is less, then the memory occupies less, also reduces the likelihood of oom;
5, the process of processing spark job if there are very many small files, this time you can reduce the number of partition by coalesce, and then reduce the number of tasks in parallel operations to reduce the opening of too much task, thereby improving the efficiency of the use of hardware;
6, when dealing with spark job, if some task is found to be running very slow, this time should consider increasing the degree of parallelism of the task, reduce the amount of data per partition to improve execution efficiency;
7, when dealing with spark job, if some task is found to run particularly slow another approach is to increase the number of parallel executor, so that each executor allocated less computing resources, can improve the overall use of hardware efficiency;
8, when dealing with spark job if it is easier to find a memory overflow, another more effective way is to reduce the number of parallel executor, so that each executor can allocate more memory, thus increasing the amount of memory used by each task, reduce the risk of oom;
9, one way to improve the spark hardware, especially CPU utilization, is to increase the executor degree of parallelism, but if executor too much, the direct allocation in each executor memory is greatly reduced, the memory operation is reduced, disk-based operations are more and more, resulting in worse performance;
10. It is very important to set the number of partition shards appropriately, too few partition shards may cause oom and frequent GC due to the large amount of partition data. Too much partition shard data may result in inefficient execution because of the small amount of data per partition;
11. If the CPU usage in spark is not high enough, consider allocating more executor to the current program, or adding more worker instances to fully utilize multicore potential;
12, the actual implementation of the spark job in accordance with the input data and each executor allocated memory to determine the parallelism of the execution time, in fact, a few simple facts, each core can be considered to allocate two or three tasks;
13, by default, executor 60% of the memory is used as the cache of the RDD, 40% of the memory is used as the object creation space, the setting is through Spark.storage.memoryFraction, the default is 0.6;
14, assuming that the data is compressed, 4 task each is a 128M data source, decompression into twice times, then the memory is 4*2*128m, this may be memory overflow, can not see the size of the existing HDFs;
14, GC generally can not exceed the CPU of 2% of the time (note: Tungsten filament plan to solve one of the core issues of GC), such as error outofmemoryerro,gc overhead limit exceeded;
Summary: The effectiveness of performance tuning is temporary, for example, by adding executor to the current application, which may improve performance at the outset (such as increased CPU usage), but with more executor, performance may degrade!!! Because executor more and more times, for each executor allocated more and less memory, task in the execution of the memory of the less available, this time will be frequent spill over to disk, at this time naturally caused performance to become worse;
Example:
1, job run slow, CPU use is not very high, this time consider increasing the degree of parallelism or number of shards, in fact, increase the utilization of CPU;
2, if the occurrence of oom, is generally a single partition is too large, consider increasing the number of shards;
3, a machine on the limited resources, if for a machine to open too much executor, also has the risk of oom, so that the memory allocated for each task becomes larger, also reduce the risk of oom;
4, there are a large number of small files, resulting in inefficient, you can consider reducing the number of file fragments;
5, anyway to localize data, can not say that HDFs in a cluster, spark in a cluster, so silly. No way can only this time, the middle of adding a tachyon;
The most affecting CPU, is the degree of parallelism
Data is rarely persist to disk, and tests have found that if data is persist to disk, it is sometimes better to recalculate
Before the official delivery, to be on-line commissioning for at least 1 weeks, to optimize performance.
The core used is determined by the input and executor when it is actually running.
==========spark Performance Optimization Moves ============
1, broadcast, if the task in the process of running more than 20KB size of static large objects, this time generally consider using broadcast, such as a large table jion a small table, you can consider the small table broadcast out, Such a big table only need to wait on their own node quietly waiting for the arrival of the small table, so that you can reduce the shuffle, may improve the performance of dozens of times times;
Liaoliang Teacher's card:
China Spark first person
Sina Weibo: Http://weibo.com/ilovepains
Public Number: Dt_spark
Blog: http://blog.sina.com.cn/ilovepains
Mobile: 18610086859
qq:1740415547
Email: [Email protected]
This article from "a Flower proud Cold" blog, declined reprint!
Liaoliang on Spark performance optimization first season! (DT Big Data Dream Factory)