There is a timer in the MetadataCleaner object, which is used to clean up the following metadata information:
MAP_OUTPUT_TRACKER: Maptask output meta information SPARK_CONTEXT: rddHTTP_BROADCAST in persistentRdds, metadata of http broadcast
BLOCK_MANAGER: the data stored in the blockmanager SHUFFLE_BLOCK_MANAGER: the output data of shuffle BROADCAST_VARS: the number of ternary broadcast broadcast
Contextcleaner cleans up real data: ContextCleaner maintains a weak reference for RDD, shuffle, broadcast, accumulator, and Checkpoint. When the related object is unreachable, the object is inserted into the referenceQueue. There is a separate thread to process the objects in this queue. RDD: finally delete the RDD data from the memoryStore and diskStore of the blockmanager of each node. shuffle: delete the mapstatuses information about the shuffleId in the driver; delete the data files and index files of all the partitions of the shuffleId in all nodes. broadcast: finally from each Delete broadcast data Chec from memoryStore and diskStore of node's blockmanager
kpoint: Clean up the file about the rddId in the checkpointDir directory. Give an example of RDD. Explain the benefits of doing so. By default, RDD is not cached, that is, after the calculation, the next use needs to be recalculated. If you want to avoid the overhead of recalculation, you must cache the RDD, and everyone knows the truth. But when will the cached RDD be released? This uses the weak references mentioned above. When we call persist to cache an RDD, it will call registerRDDForCleanup(this), which is to register its own RDD to a weak reference. When the RDD becomes unreachable, the RDD object will be automatically inserted into the referenceQueue, and the doCleanupRDD branch will be taken until the next GC. RDD may be stored in memory or disk, so as to ensure that the unreachable RDD can release the real RDD data in the blockmanager when the GC arrives. Consider again, when is RDD unreachable? In order to free up memory for use elsewhere, in addition to manual unpersist, it is necessary to organically clear the cached RDD data at the time of formulation. This is what MetadataCleaner's SPARK_CONTEXT does. It is to periodically clean up the expired data in persistentRdds, in fact, it has the same effect as unpersist. Once cleaned up, there is no strong reference to this cached RDD.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.