Wedge
One of the reasons why spark is much faster than hadoop is that the intermediate results are cached in memory rather than directly written to disk. This article attempts to analyze the composition of the storage subsystem in spark, taking Data Writing and Data Reading as an example, the interaction between various components in the storage subsystem is clearly described.
Storage subsystem Overview
Is the relationship between several main modules in the spark storage subsystem.
- Cachemanager RDD obtains data through cachemanager and stores the computing results through cachemanager.
- Blockmanager cachemanager depends on the blockmanager interface for Data Reading and access. blockmanager determines whether data is obtained from memory or diskstore.
- Memorystore stores or reads data from memory.
- Diskstore is responsible for writing data to or reading data from a disk
- Writing blockmanagerworker data to the local memorystore or diskstore is a synchronization operation. For fault tolerance, you also need to copy the data to another computing node to prevent data loss and recovery, data replication is performed asynchronously. blockmanagerworker is used to process this part of data.
- Connectionmanager is responsible for establishing connections with other computing nodes and sending and receiving data.
- Blockmanagermaster note that this module only runs on the executor where the driver application is located. The function is to record the slaveworker on which all blockids are stored. For example, the RDD task runs on machine, the required blockid is 3, but there is no value of blockid 3 on machine A. In this case, slave worker needs to use blockmanager to ask blockmanagermaster about the data storage location, then, use connectionmanager to obtain the information. For more information, see"Remote data retrieval"
Supported operations
Because blockmanager is used for actual storage control, when talking about supported operations, the public API in blockmanager is used as an example.
- Put Data Writing
- Get Data Reading
- Remoterdd data deletion. Once the entire job is completed, all intermediate computing results can be deleted.
Startup Process Analysis
The above modules are created by sparkenv.Sparkenv. CreateComplete
val blockManagerMaster = new BlockManagerMaster(registerOrLookup( "BlockManagerMaster", new BlockManagerMasterActor(isLocal, conf)), conf) val blockManager = new BlockManager(executorId, actorSystem, blockManagerMaster, serializer, conf) val connectionManager = blockManager.connectionManager val broadcastManager = new BroadcastManager(isDriver, conf) val cacheManager = new CacheManager(blockManager)
This code is confusing. It seems that blockmanagermasteractor has been created on all cluster nodes. In fact, it is not. Check the implementation of the registerorlookup function carefully.If the current node is a driver, the actor is created; otherwise, the connection to the driver is established.
def registerOrLookup(name: String, newActor: => Actor): ActorRef = { if (isDriver) { logInfo("Registering " + name) actorSystem.actorOf(Props(newActor), name = name) } else { val driverHost: String = conf.get("spark.driver.host", "localhost") val driverPort: Int = conf.getInt("spark.driver.port", 7077) Utils.checkHost(driverHost, "Expected hostname") val url = s"akka.tcp://[email protected]$driverHost:$driverPort/user/$name" val timeout = AkkaUtils.lookupTimeout(conf) logInfo(s"Connecting to $name: $url") Await.result(actorSystem.actorSelection(url).resolveOne(timeout), timeout) } }
One of the main actions during initialization is that blockmanager needs to initiate registration to blockmanagermaster.
Data Writing Process Analysis
Brief Data Writing Process
- RDD. iterator is the entry for interaction with the storage subsystem
- Cachemanager. getorcompute calls the put interface of blockmanager to write data.
- Data is first written to memorystore, that is, memory. If the data in memorystore is full, the data that is not frequently used is written to the disk.
- Notify blockmanagermaster to write new data and save the metadata in blockmanagermaster.
- Synchronize the written data with other slave worker. Generally, data written to the local machine is backed up by another machine, that is, replicanumber = 1.
Serialization or not
The specific content written can be serialized bytes or non-serialized values. Here we have an understanding of the either, left, right keywords in Scala syntax.
Data read Process Analysis
def get(blockId: BlockId): Option[Iterator[Any]] = { val local = getLocal(blockId) if (local.isDefined) { logInfo("Found block %s locally".format(blockId)) return local } val remote = getRemote(blockId) if (remote.isDefined) { logInfo("Found block %s remotely".format(blockId)) return remote } None }
Local read
First, check whether the required block data exists in the memorystore and diskstore of the Local Machine. If not, initiate a remote data acquisition.
Remote reading
Remotely obtain the call path, getremote-> dogetremote. The most important thing in dogetremote is to callBlockmanagerworker. syncgetblockTo obtain data remotely.
def syncGetBlock(msg: GetBlock, toConnManagerId: ConnectionManagerId): ByteBuffer = { val blockManager = blockManagerWorker.blockManager val connectionManager = blockManager.connectionManager val blockMessage = BlockMessage.fromGetBlock(msg) val blockMessageArray = new BlockMessageArray(blockMessage) val responseMessage = connectionManager.sendMessageReliablySync( toConnManagerId, blockMessageArray.toBufferMessage) responseMessage match { case Some(message) => { val bufferMessage = message.asInstanceOf[BufferMessage] logDebug("Response message received " + bufferMessage) BlockMessageArray.fromBufferMessage(bufferMessage).foreach(blockMessage => { logDebug("Found " + blockMessage) return blockMessage.getData }) } case None => logDebug("No response message received") } null }
The most interesting part of the above Code isSendmessagereliablysync,Remote Data Reading is undoubtedly an asynchronous I/O operation. How can the code be written here is like a synchronous operation. That is to say, how do you know the response sent from the recipient?
Don't worry. Continue to check the sendmessagereliablysync definition.
def sendMessageReliably(connectionManagerId: ConnectionManagerId, message: Message) : Future[Option[Message]] = { val promise = Promise[Option[Message]] val status = new MessageStatus( message, connectionManagerId, s => promise.success(s.ackMessage)) messageStatuses.synchronized { messageStatuses += ((message.id, status)) } sendMessage(connectionManagerId, message) promise.future }
If I say the secret is here, you will definitely say that I am talking nonsense, but it is true here. Note that the keywords promise and future do not exist.
If the future is completed, S. ackmessage is returned. Let's see where this ackmessage was written. Take a lookConnectionmanager. handlemessageCode snippets in
case bufferMessage: BufferMessage => { if (authEnabled) { val res = handleAuthentication(connection, bufferMessage) if (res == true) { // message was security negotiation so skip the rest logDebug("After handleAuth result was true, returning") return } } if (bufferMessage.hasAckId) { val sentMessageStatus = messageStatuses.synchronized { messageStatuses.get(bufferMessage.ackId) match { case Some(status) => { messageStatuses -= bufferMessage.ackId status } case None => { throw new Exception("Could not find reference for received ack message " + message.id) null } } } sentMessageStatus.synchronized { sentMessageStatus.ackMessage = Some(message) sentMessageStatus.attempted = true sentMessageStatus.acked = true sentMessageStaus.markDone() }
Note thatSentmessagestatus. markdoneTheSendmessagereliablysyncThe promise. Success defined in. Take a look at the definition of messagestatus.
class MessageStatus( val message: Message, val connectionManagerId: ConnectionManagerId, completionHandler: MessageStatus => Unit) { var ackMessage: Option[Message] = None var attempted = false var acked = false def markDone() { completionHandler(this) } }
Now I want to clarify the call relationship. The future and promise in Scala are still a little difficult to understand.
Tachyonstore
In the latest spark source code, the storage subsystem introduces tachyonstore. tachyonstore implements the HDFS file system interface in the memory. The main purpose is to use the memory as the data persistence layer as much as possible to avoid excessive disk read/write operations.
For more information about the functions of this module, see the http://www.meetup.com/spark-users/events/117307472/
Summary
A little doubt, in the spark storage subsystem, the data transmitted in the communication module is "Heartbeat detection message", "Data Synchronization message", and "data retrieval and other information flows ". If possible, you need to detach the NIC used for heartbeat detection and data synchronization, that is, data retrieval, to improve reliability.
References
Spark Source Code Analysis-storage module http://jerryshao.me/architecture/2013/10/08/spark-storage-module-analysis/
- Http://www.slideshare.net/rxin/a-tachyon-2013-0509sparkmeetup? Qid = 39ee582d-e0bf-41d2-ab01-dc2439abc626 & V = default & B = & from_search = 2