Problem 1:reduce task number not appropriate
Solution:
Need to adjust the default configuration according to the actual situation, the adjustment method is to modify the parameter spark.default.parallelism. Typically, the reduce number is set to 2-3 times the number of cores. The number is too large, causing a lot of small tasks, increasing the overhead of starting tasks, the number is too small, the task runs slowly. Therefore, the number of tasks to reasonably modify reduce is spark.default.parallelism
Issue 2:shuffle long disk IO time
Solution:
Set Spark.local.dir to multiple disks and set the disk's IO-speed disk to optimize shuffle performance by adding IO;
Large number of problem 3:map|reduce, resulting in more shuffle small files
Solution:
Merge shuffle intermediate files by setting Spark.shuffle.consolidateFiles to true, at which time the number of files is the number of reduce tasks;
Issue 4: Long serialization time and large results
Solution:
spark defaults to using the ObjectOutputStream that comes with the JDK, which results in a large result, long CPU processing time, You can set Spark.serializer to Org.apache.spark.serializer.KeyoSerializer.
In addition, if the result is already very large, it is best to use the broadcast variable way, the result you understand.
Issue 5: A single record consumes large
Solution:
Replacing map,mappartition with mappartition is calculated for each partition, and map is calculated for each record in the partition;
Problem 6:collect Slow when outputting a large number of results
Solution:
Collect source is to put all the results in an array of the way in memory, you can directly output to the distributed file system, and then view the contents of the file system;
Issue 7: Task Execution Speed tilt
solution:
If the data is skewed, generally partition key is not good, you can consider other parallel processing methods, and in the middle plus aggregation operation; If the worker tilts, For example, on some workers, executor execution is slow, you can set Spark.speculation=true to remove those nodes that continue to slow;
Issue 8: There are many empty tasks or small tasks generated after a multi-step RDD operation
Solution:
Use coalesce or repartition to reduce the number of partition in the RDD;
Problem 9:spark streaming throughput is low
You can set Spark.streaming.concurrentJobs
Problem 10:spark streaming the speed of the operation has dropped suddenly, there will often be task delay and blocking
Solution:
This is because we set the job startup interval time interval to be too short, causing each job to fail to finish at a specified time, in other words, the Windows window is created too densely spaced;
Spark Core Secret -14-spark 10 major problems in performance optimization and their solutions