Spark Technology Insider: Storage Module Overall architecture

Source: Internet
Author: User

The storage module is responsible for all the storage in the spark calculation process, including disk-based and memory-based. In actual programming, the user is faced with an rdd that can persist data from the RDD by calling Org.apache.spark.rdd.rdd#cache, and the persisted action is done by the storage module. The data included in the shuffle process is also managed by the storage module. It can be said that the RDD implements the user's logic, while storage manages the user's data. This chapter will explain the implementation of the storage module.

1.1 Module Overall architecture

Org.apache.spark.storage.BlockManager is the primary class for storage modules to interact with other modules, which provides an interface for reading and writing blocks. The block here actually corresponds to the partition mentioned in the RDD, and each partition will correspond to a block. Each block is identified by a unique block ID (org.apache.spark.storage.RDDBlockId) with the format "rdd_" + Rddid + "_" + PartitionID.

The Blockmanager will run on driver and each executor. While the Blockmanger running on the driver is responsible for the management of the entire job block, the Blockmanger running on the executor is responsible for managing the block on the executor, and report The block's information to driver's Blockmanager and receive commands from it.


Functional descriptions for each major class:

1) Org.apache.spark.storage.BlockManager: Provides the interface between storage module and other modules, and manages the storage module.

2) Org.apache.spark.storage.BlockManagerMaster:Block Management interface class, This is done primarily by calling Org.apache.spark.storage.BlockManagerMasterActor.

3) Org.apache.spark.storage.BlockManagerMasterActor: Actor on driver node, responsible for track all slave node block information

4) Org.apache.spark.storage.BlockManagerSlaveActor: Runs on all nodes, receives commands from Org.apache.spark.storage.BlockManagerMasterActor, such as deleting an R DD data, delete a block, delete a shuffle data, return some block status, etc.

5) Org.apache.spark.storage.BlockManagerSource: Responsible for collecting the metric information of storage module, including the maximum memory, the number of memory remaining, the number of memory used and the disk size used. These are implemented by invoking the Getstoragestatus interface of Org.apache.spark.storage.BlockManagerMaster.

6) Org.apache.spark.storage.BlockObjectWriter: An abstract class that can write any JVM object to an external storage system. Note that it does not support concurrent write operations.

7) Org.apache.spark.storage.DiskBlockObjectWriter: Supports writing a file directly to disk, and also supports the append of the file. It is actually an implementation of the Org.apache.spark.storage.BlockObjectWriter. Now the following class is done when it needs to spill data to disk:

A) Org.apache.spark.util.collection.ExternalSorter

b) Org.apache.spark.shuffle.FileShuffleBlockManager

8) Org.apache.spark.storage.DiskBlockManager: Manages and maintains the logical block and the mapping of the physical block stored on disk. In general, a block of logic is mapped to a physical file based on its blockid generated name. These physical files are hashed into different directories on the Spark.local.dir (or set by Spark_local_dirs).

9) Org.apache.spark.storage.BlockStore: An abstract class that stores block. Now its implementation is:

A) Org.apache.spark.storage.DiskStore

b) Org.apache.spark.storage.MemoryStore

c) Org.apache.spark.storage.TachyonStore

Org.apache.spark.storage.DiskStore: Implements storage block to disk. The write disk is implemented through Org.apache.spark.storage.DiskBlockObjectWriter.

One) Org.apache.spark.storage.MemoryStore: Implements storage block into memory.

Org.apache.spark.storage.TachyonStore: The storage block is implemented on the Tachyon.

Org.apache.spark.storage.TachyonBlockManager: Manages and maintains mappings between files on the logical block and Tachyon file systems. This is similar to the Org.apache.spark.storage.DiskBlockManager feature.

Org.apache.spark.storage.ShuffleBlockFetcherIterator: Implements the logic of fetching shuffle blocks, including reading local and initiating network requests to read on other nodes. Concrete implementation can refer to the shuffle module detailed.


If you like this article, then please move your finger support the following blog star rating Bar. Thank you very much for your vote. You can have a ticket every day.
Dot I vote

Spark Technology Insider: Storage Module Overall architecture

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.