Hadoop 2.3.0 has been released, the biggest highlight of which is centralized cache management (HDFS). This function is very helpful to improve the execution efficiency and real-time performance of Hadoop system and the upper application. This paper discusses this function from three aspects: principle, architecture and code analysis.
Mainly solved what problems
Users can specify according to their own logic some frequently used data or data corresponding to high priority tasks, so that they are resident in memory without being eliminated to disk. For example, in a data warehouse application built by Hive or Impala, the fact table will frequently do JOIN with other tables. Obviously, it should make fact resident in memory so that DataNode will not eliminate the data when memory usage is tight. SLA for mixed workloads.
The centralized cache is managed by the NameNode. Therefore, the HDFS client (such as MapReduce and Impala) can schedule the task according to the distribution of the cache by the cache and achieve memory-locality.
HDFS originally rely on DataNode OS buffer cache, so not only did not block the distribution of the cache exposed to the upper application optimization task scheduling, it may also result in cache waste. For example, a replica of the three replicas of a block are stored in three DataNote respectively. It is possible that the block is simultaneously accessed by the OS buffer cache of the three DataNodes. Therefore, from the perspective of HDFS, the same block is stored in three copies in the cache, resulting in The waste of resources.
Speed up HDFS client read speed. In the past NameNode processing read request only according to the topology to decide which DataNode to read, but also to add the speed factor. When the HDFS client and the block to be read cache in the same DataNode time, you can read directly from the memory through zero-copy read, skip the disk I / O, checksum checks and other links.
Even if the data is cached DataNode node down, block move, restart the cluster, the cache will not be affected. Because the cache is managed uniformly by the NameNode and persisted to FSImage and EditLog, if a block's DataNode in the cache goes down, the NameNode schedules the other DataNodes that stored the replica and caches it in memory.
basic concept
cache directive: Said to be cached to memory file or directory.
cache pool: Used to manage a series of cache directive, similar to the namespace. At the same time using UNIX-style documents read, write, execute permissions management mechanism. Command example:
hdfs cacheadmin -addDirective -path /user/hive/warehouse/fact.db/city -pool financial-replication 1
The above code means that the file city HDFS (in fact, a fact table Hive) into HDFS centralized cache financial cache pool, and this document only need to be cached.
System architecture and principles
You can explicitly specify a file or directory on HDFS to be placed in the HDFS centralized cache through the hdfs cacheadmin command line or the HDFS API. This centralized cache consists of off-heap memory distributed across each DataNode node and is centrally managed by the NameNode. Each DataNode node uses mmap / mlock to map and lock HDFS blocks stored in disk files into off-heap memory.
DFSClient sends a getBlockLocations RPC request to the NameNode when it reads the file. NameNode will return a List of LocatedBlocks to DFSClient. This LocatedBlock object has the DataNode where the block's replica is located and the DataNode that cached the block. Can be interpreted as the cache to the replica in memory as a high-speed three replica.
Note: The difference between centralized and distributed cache:
The distributed cache distributes files to various DataNode nodes. Local disks are saved and are not cleaned up immediately after they are exhausted. Instead, they are periodically cleaned by a dedicated thread based on the file size limit and the maximum number of files. In essence, the distributed cache does only disk locality, and the centralized cache does memory locality.
Realize the logic and code analysis
HDFS centralized cache involves multiple operations, the processing logic is very similar. To simplify the problem, take addDirective as an example.
1.NameNode processing logic
The main components within the NameNode are shown below. There is a CacheManager in the FSNamesystem which is the core component of the centralized cache at the NameNode side. We all know BlockManager is responsible for managing the block replica distributed on each DataNode, and CacheManager is responsible for managing block cache distributed on each DataNode.
DFSClient sends NameNode an RPC named addCacheDirective, defining the corresponding interface in the ClientNamenodeProtocol.proto file.
NameNode received this RPC after the treatment, the first need to be cached path Package CacheDirective Join CacheManager managed directivesByPath. At this time the corresponding File / Directory has not been cache to memory.
Once CacheManager adds a new CacheDirective, it triggers CacheReplicationMonitor.rescan () to scan and add the block that needs to notify DataNode to do the caching to the CacheReplicationMonitor.cachedBlocks map. This rescan operation will also trigger on the start of the NameNode and will trigger at regular intervals during the operation of the NameNode.
The main logic of the Rescan () function is as follows:
rescanCacheDirectives () -> rescanFile (): Traverse each of the directives to be cached (stored in the CacheManager. directivesByPath) and add each block contained in the directive's cache to the CacheReplicationMonitor.cachedBlocks collection.
rescanCachedBlockMap (): Call CacheReplicationMonitor.addNewPendingCached () for each block waiting for the cache to choose a suitable DataNode to cache (usually select the block where the three Replica DataNode where the remaining one of the largest available memory), add the corresponding The pendingCached list of DatanodeDescriptors.
2.NameNode and DataNode RPC logic
The DataNode periodically sends a heartbeat RPC to the NameNode to indicate that it is still alive. The DataNode also periodically sends a block report (default 6 hours) and cache block (default 10 seconds) to the NameNode to synchronize the state of the block and cache.
The NameNode will check the pendingCached list of the DataNode whenever it processes the heartbeat RPC of a DataNode. If it is not empty, send a DatanodeProtocol.DNA_CACHE command to the specific DataNode to cache the corresponding block replica.
3.DataNode processing logic
DataNode internal main components as shown. DataNode started just to check whether dfs.datanode.max.locked.memory exceeds the OS limit, and did not lock the memory space reserved for the Cache.
In the DataNode node corresponding to each BlockPool there is a BPServiceActor thread to send heartbeat NameNode, receiving response and processing. If received from the NameNode RPC command is DatanodeProtocol.DNA_CACHE, then call FsDatasetImpl.cacheBlock () to the corresponding block cache memory.
This function first through the RPC passed blockId to find its corresponding FsVolumeImpl (because the implementation of the cache block operation of the thread cacheExecutor is bound in the corresponding FsVolumeImpl); Then call FsDatasetCache.cacheBlock () This block is encapsulated into MappableBlock added to the mappableBlockMap In a unified management, and then to the corresponding FsVolumeImpl.cacheExecutor thread pool to submit a CachingTask asynchronous task (cache process is asynchronous execution).
FsDatasetCache has a member mappableBlockMap (HashMap) which manages all MappableBlocks and their states (caching / cached / uncaching) on this DataNode. Currently DataNode "which blocks are being cached in memory" is also saved only soft state (and NameNode block map the same), DataNode sent to the NameNode Heartbeat back from the NameNode did not persist to DataNode local hard disk.
The logic of CachingTask: Call the MappableBlock.load () method to map the corresponding block from the DataNode local disk to memory via mmap, then lock the memory space with mlock, and do a checksum on the block mapped to memory to verify its integrity. This memory-locality DFSClient can be read directly through the zero-copy memory block without the need for verification.
4.DFSClient read logic:
HDFS read three main: Network I / O read -> short circuit read -> zero-copy read. Network I / O read is a traditional HDFS read through DFSClient and Block DataNode where the establishment of a network connection to transfer data.
When DFSClient and its block to be read in the same DataNode, DFSClient can cross the network I / O directly from the local disk to read data, this method of reading data called short circuit read. Currently, the short circuit read of HDFS is achieved by accessing the file descriptor of the file on the DataNode disk in shared memory (because this is more secure than passing the file directory) and then directly establishing the local disk input stream with the corresponding file descriptor The current short circuit read is also a zero-copy read.
The HDFS read interface that adds Centralized cache has not changed. When DFSClient gets LocatedBlock over RPC, there is more than one member indicating which DataNode has this block cache in memory. If DFSClient and the block are DataNode cached together, you can greatly improve read efficiency through zero-copy read. And even though the block was uncached during reading, the read was degenerated into a local disk read, just as it was able to fetch the data.
Impact on the upper application
After a directory on HDFS has been cached by addDirective, if a new file is added to this directory, the newly added file is also automatically cached. This is useful for Hive / Impala applications.
HBase in-memory table: The HFile of an HBase table can be placed directly into the centralized cache, which significantly improves HBase read performance and reduces read request latency.
Differences from Spark RDD: Read and write operations across multiple RDDs may be done entirely in memory and recalculated for errors. HDFS centralized cache cache block must be written to disk, and then can be explicitly cache to memory. In other words can only cache read, can not cache write.
The current centralized cache is not DFSClient read who will cache, but DFSClient need to explicitly specify who cache, cache for how long, who eliminated. There is no similar replacement strategy LRU, if not enough memory when the client needs to explicitly eliminate the corresponding directive to disk.
There is no integration with YARN, the user needs to adjust the memory left for DataNode cache and NodeManager memory usage.
references
http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html
https://issues.apache.org/jira/browse/HDFS-4949
About the Author
Liang Weibo, a master's in computer science from Beijing University of Aeronautics and Astronautics, senior engineer in US Mission Network, worked and studied at France Telecom, Baidu and VMware. In recent years, he has been tossing Hadoop / HBase / Impala and data mining related issues. Bo @DataScientist.