This blog brings you the common problems and solutions of Spark.
1. Control the size of the reduce buffer to avoid OOM
In the shuffle process, the reduce task does not wait until the map task writes all its data to disk before pulling it, but the map side writes a little data, the reduce task will pull a small part of the data, and then immediately proceed to the back Operations such as aggregation and the use of operator functions.
Simple Application Server
USD1.00 New User Coupon
* Only 3,000 coupons available.
* Each new user can only get one coupon(except users from distributors).
* The coupon is valid for 30 days from the date of receipt.
How much data the reduce task can pull is determined by the buffer from which reduce pulls the data, because the pulled data is first placed in the buffer, and then subsequent processing is performed. The default size of the buffer is 48MB.
The reduce task will pull and calculate at the same time. It may not necessarily pull up 48MB of data every time. Most of the time, some data will be pulled and processed.
Although increasing the buffer size on the reduce side can reduce the number of pulls and improve the performance of Shuffle, sometimes the amount of data on the map side is very large, and the writing speed is very fast. At this time, all tasks on the reduce side may be all when they are pulled. Reach the maximum limit of its own buffer, which is 48MB. At this time, coupled with the code of the aggregate function executed on the reduce side, a large number of objects may be created, which may hardly lead to memory overflow, that is, OOM.
If there is a memory overflow on the reduce side, we can consider reducing the size of the data buffer on the reduce side, for example, to 12MB.
This kind of problem has occurred in the actual production environment. This is a typical principle of trading performance for execution. The reduction of the buffer for pulling data on the reduce side does not easily lead to OOM, but correspondingly, the number of pulls on the reudce side increases, which causes more network transmission overhead and reduces performance.
Note, to ensure that the task can run, and then consider the optimization of performance.
2. Shuffle file pull failure caused by JVM GC
In Spark jobs, sometimes the shuffle file not found error occurs. This is a very common error. Sometimes after this error occurs, you can choose to execute it again and no longer report this error.
The possible reason for the above problem is that in the Shuffle operation, the task of the later stage wants to go to the task of the previous stage to pull data from the Executor where the task is located. As a result, the other party is performing GC, and executing GC will cause all the work sites in the Executor to stop. , Such as BlockManager, netty-based network communication, etc., which will cause the subsequent task to pull data for a long time without pulling it, and it will report a shuffle file not found error, and the second time it will not be executed. This error will occur again.
The Shuffle performance can be adjusted by adjusting the two parameters, the number of retries to pull data on the reduce side and the time interval for pulling data on the reduce side. Increase the parameter value so that the number of retries on the reduce side to pull data increases, and each time The time interval to wait after failure is lengthened.
val conf = new SparkConf()
.set("spark.shuffle.io.maxRetries", "60")
.set("spark.shuffle.io.retryWait", "60s")
3. Solve various serialization errors
When the Spark job reports an error during the running process, and the error message contains similar words such as Serializable, the error may be caused by a serialization problem.
The serialization problem should pay attention to the following three points:
The custom class as the element type of the RDD must be serializable;
The external custom variables that can be used in the operator function must be serializable;
You cannot use third-party types that do not support serialization in the element types and operator functions of RDD, such as Connection.
4. Solve the problem caused by the operator function returning NULL
In some operator functions, we need to have a return value, but in some cases we do not want to have a return value. At this time, if we return NULL directly, an error will be reported, such as Scala.Math(NULL) exception.
If you encounter certain situations and do not want a return value, you can solve it in the following ways:
Return a special value, not NULL, such as "-1";
After obtaining an RDD through the operator, you can perform the filter operation on the RDD to filter the data, and filter out the data with a value of -1;
After using the filter operator, continue to call the coalesce operator for optimization.
5. Solve the problem of network card traffic surge caused by YARN-CLIENT mode
In the YARN-client mode, the Driver is started on the local machine, and the Driver is responsible for all task scheduling and requires frequent communication with multiple Executors on the YARN cluster.
Assuming there are 100 Executors and 1000 tasks, each Executor is assigned to 10 tasks. After that, the Driver will frequently communicate with the 1000 tasks running on the Executor. There are a lot of communication data and the communication category is particularly high. This leads to the possibility that the network card traffic of the local machine will increase sharply due to frequent and large network communication during the running of the Spark task.
Note that the YARN-client mode can only be used in the test environment, and the reason for using the YARN-client mode is that you can see detailed and comprehensive log information. By viewing the log, you can lock down the problems in the program and avoid the production environment Failure occurred.
In a production environment, the YARN-cluster mode must be used. In YARN-cluster mode, it will not cause a surge in local machine network card traffic. If there is a network communication problem in YARN-cluster mode, the operation and maintenance team needs to solve it.
6. Solve the problem that the JVM stack in YARN-CLUSTER mode cannot be executed due to memory overflow
When the Spark job contains SparkSQL content, it may run in YARN-client mode, but it cannot be submitted for operation in YARN-cluster mode (an OOM error is reported).
In the YARN-client mode, the Driver runs on the local machine. The PermGen configuration of the JVM used by Spark is the spark-class file on the local machine. The size of the JVM permanent generation is 128MB. This is no problem. In YARN-cluster mode, the Driver runs on a node of the YARN cluster, using default settings that have not been configured, and the PermGen permanent generation size is 82MB.
SparkSQL needs to perform complex SQL semantic analysis, syntax tree conversion, etc., which is very complicated. If the SQL statement itself is very complicated, it is likely to cause performance loss and memory consumption, especially the use of PermGen It will be bigger.
Therefore, if PermGen's occupancy exceeds 82MB, but is less than 128MB, it will be able to run in YARN-client mode but not in YARN-cluster mode.
To solve the above problems, increase the capacity of PermGen, you need to set the relevant parameters in the spark-submit script,
--conf spark.driver.extraJavaOptions="-XX:PermSize=128M -XX:MaxPermSize=256M"
Through the above method, the size of the permanent generation of the Driver is set, the default is 128MB, and the maximum is 256MB, so that the problems mentioned above can be avoided.
7. Solve the JVM stack memory overflow caused by SparkSQL
When the SQL statement of SparkSQL has hundreds or thousands of or keywords, the JVM stack memory overflow on the Driver side may occur.
The memory overflow of the JVM stack is basically due to too many levels of methods called, resulting in a large number of very deep recursions that exceed the limit of the JVM stack depth. (We guess that when SparkSQL has a large number of or statements, when parsing SQL, such as converting to syntax trees or generating execution plans, the processing of or is recursive, and when there are very many or, a lot of recursion will occur)
At this time, it is recommended to split a SQL statement into multiple SQL statements for execution, and try to ensure that each SQL statement contains less than 100 clauses. According to the actual production environment test, the or keyword of a SQL statement is controlled within 100, which usually does not cause the JVM stack to overflow
8. Persistence and the use of checkpoint
Spark persistence is not a problem in most cases, but sometimes data may be lost. If the data is lost, you need to recalculate the lost data, and then cache and use it after the calculation. In order to avoid data loss, you can Choosing to checkpoint this RDD means to persist the data on a fault-tolerant file system (such as HDFS).
After an RDD is cached and checkpointed, if the cache is found to be lost, the checkpoint data will be checked first if it does not exist. If there is, the checkpoint data will be used without recalculation. In other words, checkpoint can be regarded as a cache guarantee mechanism, if the cache fails, the checkpoint data is used.
The advantage of using checkpoint is that it improves the reliability of Spark jobs. Once there is a cache problem, there is no need to recalculate the data. The disadvantage is that the data needs to be written to file systems such as HDFS during checkpoint, which consumes a lot of performance.
This sharing ends here.