Original link: Spark Streaming performance tuning
The Spark streaming provides an efficient and convenient streaming mode, but in some scenarios the default configuration is not optimal, and even the external data cannot be processed in real time, and we need to make relevant modifications to the default configuration. Because of the reality of the scene and the amount of data is not the same, so we can not set some common configuration (otherwise Spark streaming developers will not get so many parameters, directly write dead), we need to base on the amount of data, the scene of different settings are not the same configuration, here just give advice, These tuning does not have to be tried on your program, a good configuration is a need to slowly try.
1, set the reasonable batch processing time (batchduration).
When building StreamingContext, we need to pass in a parameter that sets the time interval for Spark streaming batch processing. Spark will submit a job every batchduration time, if your job processing time exceeds the batchduration settings, it will result in the subsequent jobs can not be submitted on time, over time, more and more jobs are delayed, The result is that the entire streaming job is blocked, which indirectly leads to the inability to process data in real time, which is certainly not what we want.
In addition, although batchduration units can reach the millisecond level, experience tells us that if this value is too small it will cause a burden on the entire streaming due to frequent job submissions, so try not to set this value to less than 500ms. In many cases, it's good to set 500ms performance.
So, how do you set a good value? We can first put this value position is a larger value (such as 10S), if we find that the job is quickly submitted to complete, we can further reduce this value, know that the streaming job just can be processed in time to finish the last batch of data, then this value is the best value we want.
2. Increase Job parallelism
We need to make full use of the resources of the cluster, as far as possible to assign tasks to different nodes, on the one hand can make full use of cluster resources, on the other hand can also be timely processing of data. For example, we use streaming to receive data from Kafka, and we can set up a receiver for each Kafka partition so that we can load balance and process the data in a timely manner (for information on how to read Kafka using streaming, see the Spark Streaming and Kafka integrated Development Guide (i) and the Spark streaming and Kafka integrated Development Guide (ii)).
You can set the degree of parallelism parameter again like the Reducebykey () and join functions.
3, using Kryo serialization.
The default spark is to use the serialized class built into Java, While it is possible to handle all class serialization classes that are self-inheriting java.io.Serializable, their performance is poor, and if this becomes a performance bottleneck, you can use the Kryo serialization class for how to use Kroy in Spark, see the Customize the Kryo serialization input and output API in spark. Using serialized data can improve GC behavior well.
4. Cache data that needs to be used frequently
For some commonly used data, we can explicitly call rdd.cache()
to cache the data, which can also speed up the processing of data, but we need more memory resources.
5. Eliminate unwanted data
Over time, there is some data that is not needed, but the data is cached in memory, consumes our precious memory resources, we can configure it spark.cleaner.ttl
to a reasonable value, but this value cannot be too small, because if the subsequent calculation of the data needed to be purged will cause unnecessary trouble. Also, we can configure the option spark.streaming.unpersist
to True (the default is true) to more intelligently persist (unpersist) the RDD. This configuration allows the system to identify those rdd that do not need to be kept constantly, and then to persist them. This can reduce the memory usage of the spark Rdd, and may also improve the behavior of garbage collection.
6, set up a reasonable GC
GC is the most difficult piece in the program, and unreasonable GC behavior has a great impact on the program. In the cluster environment, we can use the parallel mark-sweep garbage collection mechanism, although this consumes more resources, but we still recommend to open. Can be configured as follows:
1 |
spark.executor.extraJavaOptions = -XX : +UseConcMarkSweepGC |
For more configuration of GC behavior, refer to Java garbage Collection related articles. This is not covered in detail here.
7, set a reasonable number of CPU resources
In many cases, the streaming program requires not much memory, but requires a lot of CPU. In the streaming program, the use of CPU resources can be divided into two major categories: (1), for receiving data, (2), for processing data. We need to set up enough CPU resources so that there is enough CPU resources to receive and process the data so that the data can be processed in a timely and efficient manner.
Spark Streaming Performance Tuning explained (GO)