HDFS centralized cache management principle and Code Analysis--Reprint

Source: Internet
Author: User
Tags spark rdd

Original address: http://yanbohappy.sinaapp.com/?p=468

Hadoop 2.3.0 has been released, the biggest highlight of which is centralized cache management (HDFS centralized cache management). This feature helps to improve the execution efficiency and real-time performance of Hadoop and upper-level applications, and explores this feature from three perspectives: principle, architecture, and code analysis.

What are the main issues

1. The user can specify some of the data that is used frequently or the high priority task according to their own logic to resident memory without being retired to disk. For example, in a data warehouse application built with hive or Impala, the fact table frequently joins other tables, and it is clear that the fact resides in memory so that Datanode does not retire the data when memory usage is tense, and also implements the mixed Workloads's SLA.

2.centralized Cache is managed by Namenode, then the HDFS client (such as MapReduce, Impala) can dispatch tasks according to the distribution of the cache, so that the memory-locality.

3.HDFS originally simply rely on Datanode OS buffer cache, so not only the location of the cache has not been exposed to the upper application optimization task scheduling, but also may cause cache waste. For example, a block of three replica stored on three datanote, it is possible that the block is also the three Datanode OS buffer cache, then from the global view of HDFs, there is the same block in the cache saved three copies, caused a waste of resources.

4. Speed up the HDFs client reading speed. In the past Namenode processing read requests only according to the topological distance to decide which datanode to read, now also add speed factor. When the HDFs client and the block to be read are in the same datanode as the cache, it is possible to read directly from memory via Zero-copy read, skipping disk I/O, checksum checksum.

5. Even if the data is down by the Datanode node of the cache, the block moves, the cluster restarts, and the cache is not affected. Because the cache is uniformly managed by namenode and persisted to Fsimage and Editlog, if the cache has a block of Datanode down, Namenode dispatches other replica that stores the Datanode, Cache it to memory.

Basic concepts

Cache directive: Represents the file or directory to be cache to memory.
Cache pool: Used to manage a series of cache directive, similar to namespaces. Also use UNIX-style files to read, write, and execute rights management mechanisms. Command examples:

HDFs Cacheadmin-adddirective-path/user/hive/warehouse/fact.db/city-pool financial-replication 1

The above code indicates that the file city on HDFs (which is actually a fact table on hive) is placed under the financial cache pool of HDFs centralized cache, and that the file needs to be cached only one copy.

System Architecture and principles

Users can explicitly specify that a file or directory on HDFs be placed in the HDFs centralized cache via the HDFs cacheadmin command line or the HDFs API. The centralized cache consists of off-heap memory distributed across each Datanode node, and is managed centrally by Namenode. Each datanode node maps and locks the HDFs block stored in the disk file into Off-heap memory using Mmap/mlock.

Dfsclient sends a getblocklocations RPC request to Namenode when the file is read. Namenode will return a list of Locatedblock to Dfsclient, the Locatedblock object with the replica of the block Datanode and the cache of the block Datanode. It can be understood as a high-speed replica that replica the cache into memory as a three-copy.

Note: Centralized Cache and the Distributed Cache The difference:

Distributed Cache Distributing files to individual DataNode The node local disk is saved and is not cleaned up immediately after it is exhausted, but is periodically cleaned by a dedicated thread based on the file size limit and the maximum number of files. Essentially distributed cache only made disk locality, and centralized cache do the memory locality.

Implementing Logic and code profiling

The HDFS centralized cache involves multiple operations with very similar processing logic. To simplify the problem, adddirective this operation as an example.

1.NameNode processing Logic

Namenode the main internal components. A CacheManager in Fsnamesystem is the core component of the centralized cache on the Namenode side. We all know that Blockmanager is responsible for managing the block replica distributed across Datanode, while CacheManager is responsible for managing the block caches that are distributed across Datanode.

Dfsclient sends an RPC named Addcachedirective to Namenode, which defines the appropriate interface in the Clientnamenodeprotocol.proto file.

Namenode received this RPC after processing, first the need to be cached path wrapped into Cachedirective added CacheManager managed Directivesbypath. At this point the corresponding file/directory is not cache to memory.

Once CacheManager has added a new cachedirective, Trigger Cachereplicationmonitor.rescan () To scan and add the block that needs to notify Datanode cache to Cachereplicationmonitor. The Cachedblocks map. This rescan operation is also triggered when the Namenode is started, and at regular intervals during the Namenode run.

The main logic of the Rescan () function is as follows:

Rescancachedirectives ()->rescanfile (): Iterates through each directive waiting to be cache (stored in CacheManager. Directivesbypath), add each block that waits for the cache's directive to be included in the Cachereplicationmonitor.cachedblocks collection.

Rescancachedblockmap (): Call cachereplicationmonitor.addnewpendingcached () Choose a suitable datanode for each block waiting to be cache (typically the three replica that choose this block are the Datanode where the remaining available memory is the largest), Add the pendingcached list of the corresponding datanodedescriptor.

RPC logic for 2.NameNode and Datanode

Datanode periodically sends heartbeat RPC to Namenode to indicate that it is still alive, and Datanode periodically sends block report (default 6 hours) and cache to Namenode Block (default 10 seconds) is used to synchronize the state of the block and cache.

Namenode will check if the pendingcached list of the Datanode is empty at each processing of a datanode heartbeat RPC and is not empty to send Datanodeprotocol.dna_ The cache command gives the specific datanode to the cache corresponding to the block replica.

3.DataNode processing Logic

Datanode the main internal components. When Datanode starts, it checks to see if the dfs.datanode.max.locked.memory exceeds the OS limit and does not lock the memory space reserved for the cache.

Each blockpool on the Datanode node should have a Bpserviceactor thread send heartbeat to Namenode, receive response, and process it. If the command received from Namenode RPC is Datanodeprotocol.dna_cache, then call Fsdatasetimpl.cacheblock () to the corresponding block CACHE to memory.

This function first finds its corresponding Fsvolumeimpl via RPC blockid (because the thread that performs the cache block operation Cacheexecutor is bound to the corresponding Fsvolumeimpl) And then call Fsdatasetcache.cacheblock () to encapsulate the block into a mappableblock to join the Mappableblockmap and manage it. It then submits a Cachingtask asynchronous task to the corresponding Fsvolumeimpl.cacheexecutor thread pool (the cache process executes asynchronously).

Fsdatasetcache has a member Mappableblockmap (HASHMAP) that manages all mappableblock of the Datanode and its state (caching/cached/uncaching). The current Datanode "which blocks are cache in memory" is also only saved soft state (like the Namenode block map), is datanode to Namenode After sending heartbeat from Namenode, there is no persistence to the Datanode local hard drive.

Cachingtask logic: Call the Mappableblock.load () method to map the corresponding block from the Datanode local disk through mmap to memory, and then lock the memory space through Mlock, And this mapping to the memory block to do checksum test its integrity. This allows the dfsclient of the memory-locality to be read directly into the block in memory without the need for verification by zero-copy.

4.DFSClient Read logic:

There are three main types of HDFs reading: network I/O read, short circuit read, zero-copy read. Network I/O reading is the traditional HDFs read, which transmits data through the datanode of Dfsclient and block.

When the dfsclient and the block it is reading are in the same datanode, dfsclient can read the data directly from the local disk across network I/O, which is called short circuit read. The current HDFS implementation of short circuit read is to read the block on the Datanode disk through shared memory file descriptor (because this is more secure than the file directory), and then directly with the corresponding files Descriptor established a local disk input stream, so the current short circuit read is also a zero-copy read.

The read interface for HDFs with centralized cache added has not changed. When Dfsclient gets locatedblock through RPC, there are a few more members that indicate which datanode the block caches into memory. If the dfsclient and the block are datanode by the cache, the read efficiency can be greatly improved by zero-copy read. And even if the block is uncache in the process of reading, the read is degraded to a local disk read, and the same data can be obtained.

Impact on upper-level applications

For a directory on HDFs that has been adddirective cached, the newly added file will be automatically cached if a new file is added to the directory. This is useful for Hive/impala-style applications.

HBase in-memory table: You can put the hfile of an hbase table directly into the centralized cache, which significantly improves the read performance of HBase and reduces the delay in read requests.

And Spark Rdd: the read and write operations between multiple RDD may be completely in memory, and the error will be counted. The cache block in HDFS centralized cache must have been written to disk before it can be explicitly cache-to-memory. This means that only the cache is read, not the cache write.

The current centralized cache is not dfsclient read who will be the cache, but need to dfsclient explicitly specify who to cache, how long the cache, who eliminated. At present there is no similar to LRU replacement strategy, if the memory is not enough to use when the client needs to explicitly eliminate the corresponding directive to disk.

It is not yet integrated with yarn and needs to be adjusted to the memory used by Datanode for cache memory and NodeManager.

Reference documents

Http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html

https://issues.apache.org/jira/browse/HDFS-4949

HDFS centralized cache management principle and Code Analysis--Reprint

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.