Hadoop 2.3.0 has been released, and the biggest bright spot is centralized caching management (HDFS centralized cache management). This feature is useful for improving the efficiency and timeliness of the implementation of Hadoop and upper tier applications, and this article explores this from three perspectives: principle, architecture and Code analysis.
What are the main issues that have been solved
Users can specify some data that is often used or high priority tasks based on their own logic, allowing them to reside in memory without being eliminated to disk. For example, in a data warehouse application built in hive or Impala, fact tables frequently join other tables, and it is obvious that fact resident memory should be allowed so that the datanode does not eliminate the data when memory usage is tight, and also implements the mixed SLA for workloads.
Centralized cache is a unified management by Namenode, then HDFs client (such as MapReduce, Impala) can be based on the distribution of block cache to dispatch tasks, do memory-locality.
HDFs originally rely solely on Datanode OS buffer cache, so that not only did not put block cache distribution of the external exposure to the upper application optimization task scheduling, but also may cause cache waste. For example, a block of three replica stored in three datanote, it is possible that this block at the same time by the three Datanode OS buffer cache, then from the HDFs of the overall view of the same block in the cache saved three, caused a waste of resources.
Speed up the HDFs client read speed. In the past Namenode processing read requests only based on the topological distance to decide which datanode to read, and now add speed factors. When the HDFs client and to read the block is cache in the same datanode, you can read directly from memory through zero-copy, skipping disk I/O, checksum check and other links.
Even if the data is cache Datanode node downtime, block mobile, cluster restart, cache will not be affected. Because the cache is Namenode unified management and is persisted to the fsimage and Editlog, if a cache block of Datanode downtime, Namenode will dispatch other stores this replica datanode, Cache it into memory.
Basic concepts
Cache directive: A file or directory that represents a cache to memory.
Cache pool: Used to manage a series of cache directive, similar to namespaces. It also uses UNIX-style file read, write, and execute rights management mechanisms. command example:
HDFs Cacheadmin-adddirective-path/user/hive/warehouse/fact.db/city-pool financial-replication 1
The code above indicates that the file city on the HDFs (actually a fact table on the hive) is placed under the cache pool of HDFs centralized cache, and that the file only needs to be cached.
System Architecture and principles
The user can explicitly specify that a file or directory on the HDFs be placed in the HDFs cache by HDFs the cacheadmin command line or the HDFs API. This centralized cache is composed of off-heap memory distributed at each Datanode node and is managed uniformly by Namenode. Each datanode node maps and locks the HDFs block stored in the disk file into Off-heap memory using Mmap/mlock.
Dfsclient sends a getblocklocations RPC request to Namenode when the file is read. Namenode will return a locatedblock list to dfsclient, which has replica and cache of this block Datanode in the Locatedblock object. Can be understood as the cache to the memory of the replica as a three-copy of a high-speed replica.
Note: The difference between centralized cache and distributed cache:
Distributed cache distributes files to the local disk of each Datanode node and is not cleaned up immediately, but is periodically cleaned up by a dedicated thread based on file size limits and the maximum number of files. Essentially distributed cache only do disk locality, and centralized cache did memory locality.