Cache in HDFS on waitingforcode.com

Versions: Hadoop 2.7.2

Hadoop 2.3.0 brought an in-memory caching layer to HDFS. Even if this is quite old feature (released in 02/2014), it's always beneficial to know it.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

This post is organized as follow: at the beginning, the architecture of HDFS cache is presented. After that, in the second part, 2 main concepts used in caching are described: directives and pools. The last part shows how to manipulate cache through cacheadmin command.

Architecture

Cache in HDFS is a centralized cache managed by NameNode and stored on DataNodes, exactly as it's the case of blocks. Caching is especially useful for often accessed files because these files are served from memory without any additional checksum operations (it's done once by DataNode).

HDFS uses an off-heap cache. It means that it's stored out of heap and is not eligible for garbage collection. Instead of that, off-heap cache is removed by other strategies, such as time-to-live (TTL) implemented in HDFS. The amount of reserved space in DataNode's memory is specified in dfs.datanode.max.locked.memory configuration property.

As already told, NameNode coordinates caching. When a new entry is expected to be written, NameNode handles client's requests and defines cached blocks from specified path. Such formatted information is persisted in edit logs. When a DataNode sends a heartbeat request, NameNode checks if this DataNode should cache or uncache some blocks. If it's the case, the NameNode sends appropriate instruction to DataNode.

NameNode tracks cache state through cache reports. Similarly to block reports, cache reports are sent by DataNodes to NameNode and contain the list of cached blocks. After receiving this report, the NameNode responds to DataNode with cache instructions to execute.

Directives and pools

Conceptually cache in HDFS is represented by 2 concepts: cache directive and cache pool. Cache directive represents the path to cache which can be a file or a directory. In the case of file, all its blocks are cached. In the case of directory, the blocks of all files are cached, but only at the 1st level - caching is not recursive. Each directive is characterized by 2 parameters: replication factor and expiration time.

In the other side, cache pool defines the set of cached resources. It's described by: UNIX-like permissions, set of managed cache directives, maximum size and maximum time-to-live (directive's TTL can't be bigger than this value).

Cache management

Cache can be manipulated with cacheadmin command. It allows to manage directives as well as pools. Below you can find an example of directive and pool definition:

# Copy not-empty file from local FS to HDFS
hadoop fs -copyFromLocal ~/code/bigdata/hadoop/tested_file.txt /tested_file.txt
# Create new pool keeping cached data maximaly during 2 minutes
hdfs cacheadmin -addPool text_files -maxTtl 120s
Successfully added cache pool text_files.

# Check if pool was added
hdfs cacheadmin -listPools -stats
Found 1 result.
NAME        OWNER      GROUP      MODE            LIMIT            MAXTTL  BYTES_NEEDED  BYTES_CACHED  BYTES_OVERLIMIT  FILES_NEEDED  FILES_CACHED
text_files  bartosz  bartosz  rwxr-xr-x   unlimited  000:00:02:00.000             0             0                0             0             0

# Create new directive
hdfs cacheadmin -addDirective -path /tested_file.txt -pool text_files -ttl 60s
Added cache directive 1

# Check if directive was added
hdfs cacheadmin -listDirectives
Found 1 entry
 ID POOL         REPL EXPIRY                    PATH             
  1 text_files      1 2016-12-04T13:46:55+0100  /tested_file.txt 

# Check if pool changed 
hdfs cacheadmin -listPools -stats
Found 1 result.
NAME        OWNER      GROUP      MODE            LIMIT            MAXTTL  BYTES_NEEDED  BYTES_CACHED  BYTES_OVERLIMIT  FILES_NEEDED  FILES_CACHED
text_files  bartosz  bartosz  rwxr-xr-x   unlimited  000:00:02:00.000             7             0                0             1             0

# After waiting 1 minut (TTL of cached file)
# the pool looks like:
hdfs cacheadmin -listPools -stats
Found 1 result.
NAME        OWNER      GROUP      MODE            LIMIT            MAXTTL  BYTES_NEEDED  BYTES_CACHED  BYTES_OVERLIMIT  FILES_NEEDED  FILES_CACHED
text_files  bartosz  bartosz  rwxr-xr-x   unlimited  000:00:02:00.000             0             0                0             0             0

We can also see what happens thanks to logged events on NameNode's side:

INFO org.apache.hadoop.hdfs.server.namenode.CacheManager: addCachePool of {poolName:text_files, ownerName:null, groupName:null, mode:null, limit:null, maxRelativeExpiryMs:120000} successful.
INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true    ugi=bartosz (auth:SIMPLE)     ip=/127.0.0.1   cmd=addCachePool        src={poolName:text_files, ownerName:bartosz, groupName:bartosz, mode:0755, limit:9223372036854775807, maxRelativeExpiryMs:120000}   dst=null        perm=null       proto=rpc
# Cache report content
DEBUG BlockStateChange: *BLOCK* NameNode.cacheReport: from DatanodeRegistration(127.0.0.1:50010, datanodeUuid=18a77779-6605-468b-ab90-bd8ae5579062, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-876fa79c-2cc6-483f-aa7e-12818ea211ab;nsid=946764079;c=0) 0 blocks

INFO org.apache.hadoop.hdfs.server.namenode.CacheManager: addDirective of {path: /tested_file.txt, pool: text_files, expiration: 000:00:01:00.000} successful.
INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true    ugi=bartosz (auth:SIMPLE)     ip=/127.0.0.1   cmd=addCacheDirective   src={id: 1, path: /tested_file.txt, replication: 1, pool: text_files, expiration: 2016-12-04T13:46:55+0100}     dst=null        perm=null       proto=rpc
 
# Cache report, this time contains block 
2016-12-04 13:46:46,453 DEBUG BlockStateChange: *BLOCK* NameNode.cacheReport: from DatanodeRegistration(127.0.0.1:50010, datanodeUuid=18a77779-6605-468b-ab90-bd8ae5579062, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-876fa79c-2cc6-483f-aa7e-12818ea211ab;nsid=946764079;c=0) 1 blocks

# Block report after expiration time
2016-12-04 13:46:58,454 DEBUG BlockStateChange: *BLOCK* NameNode.cacheReport: from DatanodeRegistration(127.0.0.1:50010, datanodeUuid=18a77779-6605-468b-ab90-bd8ae5579062, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-876fa79c-2cc6-483f-aa7e-12818ea211ab;nsid=946764079;c=0) 1 blocks

And below how DataNode handles cache requests:

2016-12-04 13:46:04,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_CACHE for BP-657258792-127.0.1.1-1480941885218 of [1073741825]
2016-12-04 13:47:04,454 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_UNCACHE for BP-657258792-127.0.1.1-1480941885218 of [1073741825]

This post shows caching feature in HDFS. The first part explains how NameNode coordinates this task. It appears to be almost the same thing as in the case of blocks placement. The second part explains what are cache directives and cache pools. We can learn that the directives represents cached elements (files or directories) and pools are logical groups of directives. The last part shows how to manipulate cache through cacheadmin command. It also shows, through some logs, the proof about the existence of cache reports and roles separation between NameNode and DataNode.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects