Spark Fast Big Data analytics
8.4.2 Critical performance considerations for memory management
Memory for Spark several different uses, understanding and tuning Spark's memory usage
Can help optimize your spark application. In each actuator process, there is a list of centralized uses.
RDD Storage |
When the persist () or cache () method of the Rdd is called, the RDD partition is stored in the buffer. Spark limits the amount of memory used to cache based on spark.stroage.memoryFraction to the scale of the entire JVM heap space. If the limit is exceeded, the old partition data is moved out of memory. |
Buffer for data blending and aggregation |
When doing a data blending operation, Spark creates some intermediate buffers to store the output data for data blending. These buffers are used to store the intermediate results of the aggregation operation, as well as some of the cached data that is output directly from the data blending operation. Spark tries to limit the amount of memory in this buffer to the total memory according to Spark.shuffle.memoryFraction. |
User code |
Spark can execute arbitrary user code, so the user's function can request a large amount of memory on its own. For example, a user application that allocates a large array or other object can consume the total memory. User code can access all the remaining space allocated to the RDD storage and data-mix storage in the JVM heap space. |
By default, Sparkhi uses 60% of the space to store data generated by the rdd,20% store data blending operation, leaving the remaining 20% to the user program. Users can adjust these options themselves to achieve better performance. If a large number of objects are allocated on the user code, reducing the space occupied by the RDD storage and data-mix storage can effectively avoid low-memory situations. In addition to adjusting the area scale of memory, some of the features of caching behavior can be improved for some workloads. The default cache () operation of Spark persists data at Memory_only storage level. This means that if there is not enough space to cache the new RDD partition, the old partition will be deleted directly. When the partition data is used, it is recalculated. So sometimes it's better to call the persist () method with the Monory_and_disk storage level, because I at this level of storage, the old partitions that don't fit in memory are written to disk and then read back from disk when they need to be used again. The cost is likely to be much lower than the recalculation of partitions, and it can lead to more stable performance. This setting is especially useful when the cost of the RDD partition is heavy. Another improvement to the default cache policy is to cache the serialized object rather than the direct cache. This can be achieved by memory_only_ser or Memory_and_disk_ser storage levels. Caching the serialized object slows down the caching process because the serialized object also consumes some cost, but this can significantly reduce the garbage collection time of the JVM, because many independent records can now be stored as a single serialized cache. The cost of garbage collection is related to the number of objects in the heap, not the number of bytes in the data. This caching method serializes a large number of objects into a large buffer object. If you need to cache large amounts of data as an object, or if you notice a lengthy garbage collection pause, consider configuring this option. These pause times can be seen in the garbage collection time column for each task displayed in the app interface.
Rdd Key performance considerations for memory management