Spark Technology Insider: Storage Module Overall architecture

Last Update:2015-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The storage module is responsible for all the storage in the spark calculation process, including disk-based and memory-based. In actual programming, the user is faced with an rdd that can persist data from the RDD by calling Org.apache.spark.rdd.rdd#cache, and the persisted action is done by the storage module. The data included in the shuffle process is also managed by the storage module. It can be said that the RDD implements the user's logic, while storage manages the user's data. This chapter will explain the implementation of the storage module.

1.1 Module Overall architecture

Org.apache.spark.storage.BlockManager is the primary class for storage modules to interact with other modules, which provides an interface for reading and writing blocks. The block here actually corresponds to the partition mentioned in the RDD, and each partition will correspond to a block. Each block is identified by a unique block ID (org.apache.spark.storage.RDDBlockId) with the format "rdd_" + Rddid + "_" + PartitionID.

The Blockmanager will run on driver and each executor. While the Blockmanger running on the driver is responsible for the management of the entire job block, the Blockmanger running on the executor is responsible for managing the block on the executor, and report The block's information to driver's Blockmanager and receive commands from it.

Functional descriptions for each major class:

1) Org.apache.spark.storage.BlockManager: Provides the interface between storage module and other modules, and manages the storage module.

2) Org.apache.spark.storage.BlockManagerMaster:Block Management interface class, This is done primarily by calling Org.apache.spark.storage.BlockManagerMasterActor.

3) Org.apache.spark.storage.BlockManagerMasterActor: Actor on driver node, responsible for track all slave node block information

4) Org.apache.spark.storage.BlockManagerSlaveActor: Runs on all nodes, receives commands from Org.apache.spark.storage.BlockManagerMasterActor, such as deleting an R DD data, delete a block, delete a shuffle data, return some block status, etc.

5) Org.apache.spark.storage.BlockManagerSource: Responsible for collecting the metric information of storage module, including the maximum memory, the number of memory remaining, the number of memory used and the disk size used. These are implemented by invoking the Getstoragestatus interface of Org.apache.spark.storage.BlockManagerMaster.

6) Org.apache.spark.storage.BlockObjectWriter: An abstract class that can write any JVM object to an external storage system. Note that it does not support concurrent write operations.

7) Org.apache.spark.storage.DiskBlockObjectWriter: Supports writing a file directly to disk, and also supports the append of the file. It is actually an implementation of the Org.apache.spark.storage.BlockObjectWriter. Now the following class is done when it needs to spill data to disk:

A) Org.apache.spark.util.collection.ExternalSorter

b) Org.apache.spark.shuffle.FileShuffleBlockManager

8) Org.apache.spark.storage.DiskBlockManager: Manages and maintains the logical block and the mapping of the physical block stored on disk. In general, a block of logic is mapped to a physical file based on its blockid generated name. These physical files are hashed into different directories on the Spark.local.dir (or set by Spark_local_dirs).

9) Org.apache.spark.storage.BlockStore: An abstract class that stores block. Now its implementation is:

A) Org.apache.spark.storage.DiskStore

b) Org.apache.spark.storage.MemoryStore

c) Org.apache.spark.storage.TachyonStore

Org.apache.spark.storage.DiskStore: Implements storage block to disk. The write disk is implemented through Org.apache.spark.storage.DiskBlockObjectWriter.

One) Org.apache.spark.storage.MemoryStore: Implements storage block into memory.

Org.apache.spark.storage.TachyonStore: The storage block is implemented on the Tachyon.

Org.apache.spark.storage.TachyonBlockManager: Manages and maintains mappings between files on the logical block and Tachyon file systems. This is similar to the Org.apache.spark.storage.DiskBlockManager feature.

Org.apache.spark.storage.ShuffleBlockFetcherIterator: Implements the logic of fetching shuffle blocks, including reading local and initiating network requests to read on other nodes. Concrete implementation can refer to the shuffle module detailed.

If you like this article, then please move your finger support the following blog star rating Bar. Thank you very much for your vote. You can have a ticket every day.
Dot I vote

Spark Technology Insider: Storage Module Overall architecture

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark Technology Insider: Storage Module Overall architecture

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark Technology Insider: Storage Module Overall architecture

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support