Spark Cache Method

Source: Internet
Author: User
Keywords spark spark cache spark cache method
Spark cache cleaning mechanism:

There is a timer in the MetadataCleaner object, which is used to clean up the following metadata information:

MAP_OUTPUT_TRACKER: Maptask output meta information SPARK_CONTEXT: rddHTTP_BROADCAST in persistentRdds, metadata of http broadcast

BLOCK_MANAGER: the data stored in the blockmanager SHUFFLE_BLOCK_MANAGER: the output data of shuffle BROADCAST_VARS: the number of ternary broadcast broadcast

Contextcleaner cleans up real data: ContextCleaner maintains a weak reference for RDD, shuffle, broadcast, accumulator, and Checkpoint. When the related object is unreachable, the object is inserted into the referenceQueue. There is a separate thread to process the objects in this queue. RDD: finally delete the RDD data from the memoryStore and diskStore of the blockmanager of each node. shuffle: delete the mapstatuses information about the shuffleId in the driver; delete the data files and index files of all the partitions of the shuffleId in all nodes. broadcast: finally from each Delete broadcast data Chec from memoryStore and diskStore of node's blockmanager

kpoint: Clean up the file about the rddId in the checkpointDir directory. Give an example of RDD. Explain the benefits of doing so. By default, RDD is not cached, that is, after the calculation, the next use needs to be recalculated. If you want to avoid the overhead of recalculation, you must cache the RDD, and everyone knows the truth. But when will the cached RDD be released? This uses the weak references mentioned above. When we call persist to cache an RDD, it will call registerRDDForCleanup(this), which is to register its own RDD to a weak reference. When the RDD becomes unreachable, the RDD object will be automatically inserted into the referenceQueue, and the doCleanupRDD branch will be taken until the next GC. RDD may be stored in memory or disk, so as to ensure that the unreachable RDD can release the real RDD data in the blockmanager when the GC arrives. Consider again, when is RDD unreachable? In order to free up memory for use elsewhere, in addition to manual unpersist, it is necessary to organically clear the cached RDD data at the time of formulation. This is what MetadataCleaner's SPARK_CONTEXT does. It is to periodically clean up the expired data in persistentRdds, in fact, it has the same effect as unpersist. Once cleaned up, there is no strong reference to this cached RDD.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.