Centralized cache management in "Hadoop learning" HDFs

Source: Internet
Author: User

Hadoop version: 2.6.0

This article is from the Official document translation, reproduced please respect the work of the translator, note the following links:

Http://www.cnblogs.com/zhangningbo/p/4146398.html

Overview

Centralized cache management in HDFs is an explicit caching mechanism that allows the user to specify the HDFs path to cache. Namenode will communicate with all Datanode that hold the required fast data and instruct them to cache the block data in the OFF-HEAP cache.

HDFS Centralized cache management has many significant advantages:

1. A clear lock can prevent frequently used data from being purged from memory. This is especially important when the size of the working set exceeds the main memory size, which is commonplace for many hdfs loads.

2. Since the Datanode cache is managed by Namenode, the application can query a set of cache block locations when determining where to place the task. Placing a copy of the task and the cache block in one place can improve the performance of the read operation.

3. When the block has been Datanode cached, the client can use a new and more efficient 0 copy read Operation API. Because the checksum checksum of cached data is performed only once by Datanode, there is essentially no overhead for the client when using this new API.

4. Centralized caching can increase memory utilization across the cluster. When relying on the buffer cache of the operating system on each datanode, repeating a block of data causes all n copies of the block to be fed into the buffer cache. With centralized cache management, the user can explicitly lock only the m of these n replicas, saving the amount of n-m memory.

Usage Scenarios

Centralized cache management is useful for files that are repeatedly accessed. For example, a small fact table in hive (often used for joins operations) is a good cache object. On the other hand, it is likely that the input data for a full-year report query is cached, because the historical data is read only once.

Centralized cache management is also useful for hybrid workloads with performance SLAs. The high-priority load that the cache is using can guarantee that it will not compete for disk I/O with low-priority loads.

Architecture

In this architecture, Namenode is responsible for coordinating the off-heap caches of all datanode in the cluster. The Namenode periodically receives a cache report from each Datanode that describes all the blocks cached in the given DN. Namenode manages the Datanode cache by using caching and non-caching commands on the Datanode heartbeat.

Namenode queries its own set of cache instructions to determine which path should be cached. Cache instructions are always stored in the Fsimage and edit logs, and can be added, removed, or modified through Java and the command-line API. Namenode also stores a set of cache pools, which are administrative entities that are used to group cache directives for resource management classes and mandatory permission classes.

Namenode periodically repeats the scanning of namespaces and active cache designations to determine which blocks need to be cached or not cached, and to assign cache tasks to Datanode. Repeated scans can also be triggered by user actions, such as adding or removing a cache instruction, or deleting a cache pool.

Currently, we do not cache block data under construction, corrupt, or other incomplete blocks . If a cache instruction contains a symbolic link, the symbolic link target is not cached.

Currently, we have only implemented file or directory-level caches. Block and sub-block caching are future targets.

Concept

Cache directives

Cache Pool

Cache Management Command Interface

Cache Management Commands

Adddirective

Removedirective

Removedirectives

Listdirectives

Cache Pool Command

Addpool

Modifypool

RemovePool

Listpools

Help

Configuration

Local library

In order to lock block files in memory, Datanode relies on local JNI code (the Linux system is hadoop.dll for the libhadoop.so,windows system). If you are using HDFS centralized cache management, make sure to enable JNI.

Configuration Properties

Required Properties

Property name Default value Describe
Dfs.datanode.max.locked.memory This parameter is used to determine the maximum amount of memory that Datanode uses for the cache. The Datanode user's "Locked-in-memory size" Ulimit (ulimit-l) also needs to be added to match this parameter (see OS restrictions below). When this value is set, keep in mind that you also need some memory space for other things, such as Datanode and application JVM heap memory, and the page cache of the operating system.

Optional properties

The following properties are not required, but may be used for tuning:

Property name Default value Describe
dfs.namenode.path.based.cache.refresh.interval.ms 300000 Namenode uses this parameter as a time interval between the actions of the two-sub-path cache repeat scan, in milliseconds. This parameter calculates the blocks to be cached and each datanode contains a copy of the block that should be cached .
Dfs.datanode.fsdatasetcache.max.threads.per.volume 4 Datanode uses this parameter as the maximum number of threads to be used per volume when caching new data.
Dfs.cachereport.intervalMsec 10000 Datanode uses this parameter as a time interval between the actions reported to Namenode by the two send cache status. The unit is in milliseconds.
Dfs.namenode.path.based.cache.block.map.allocation.percent 0.25 The percentage of Java heap memory that is allocated to the cached block mappings. It is a hash map, using a chain hash. If the number of cache blocks is large, the smaller the map, the slower the access, and the larger the map, the more memory it consumes.

OS Limitations

If you encounter the error "cannot start Datanode because the configured max locked memory size ... is more than the Datanode ' s available rlimi T_memlock Ulimit, "means that the operating system imposes a limit on memory usage that can be locked by the user, which is less than the value set by the user. To fix this problem, you must use the "ulimit-l" command to adjust the memory values that the Datanode operation needs to lock. Typically, this value is configured in the/etc/security/limits.conf file. However, it can also vary depending on the user's worry system and distribution.

When you run "ulimit-l" from the shell and get a higher value than you set with the property dfs.datanode.max.locked.memory, or the string "ulimited" (which means there is no limit), you will understand that you have correctly configured the value. It is worth noting that the ULIMIT-L command typically outputs a memory lock limit value in kilobytes, but the value of dfs.datanode.max.locked.memory must be in bytes.

This information does not apply to Windows environment deployments. Windows does not have a command that corresponds to "ulimit-l".

Centralized cache management in "Hadoop learning" HDFs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.