Recently, when upgrading a framework, it was found that the GC overhead limit exceeded error occurred at some point in time for a streaming computation program.
This problem is certainly not enough memory, but the initial set of memory is enough ah, so a variety of memory optimization, such as the definition of the variable in the loop outside the body control, but found that only the interval of time to push back a bit.
Still did not find the crux of the problem.
Later analysis of the next, may be what variables accounted for the memory is not released in time,
There are several Dataframe cache codes, but this cache should have a mechanism to automatically release the cleanup from Spark.
In order to test, manually add unpersist for memory release, and then go online, and found that the problem is gone.
It turns out that this problem is really a memory problem.
Take a closer look at the official note.
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) FAs Hion. If you would like to manually remove a RDD instead of waiting for it and fall out of the cache, use the rdd.unpersist () Me Thod.
Perhaps this automatic mechanism is a bit too late in streaming calculations, resulting in errors. The pit is still very deep.