Hadoop 2.3.0 brought an in-memory caching layer to HDFS. Even if this is quite old feature (released in 02/2014), it's always beneficial to know it.
Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote
one on that topic! You can read it online
on the O'Reilly platform,
or get a print copy on Amazon.
I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩
This post is organized as follow: at the beginning, the architecture of HDFS cache is presented. After that, in the second part, 2 main concepts used in caching are described: directives and pools. The last part shows how to manipulate cache through cacheadmin command.
Architecture
Cache in HDFS is a centralized cache managed by NameNode and stored on DataNodes, exactly as it's the case of blocks. Caching is especially useful for often accessed files because these files are served from memory without any additional checksum operations (it's done once by DataNode).
HDFS uses an off-heap cache. It means that it's stored out of heap and is not eligible for garbage collection. Instead of that, off-heap cache is removed by other strategies, such as time-to-live (TTL) implemented in HDFS. The amount of reserved space in DataNode's memory is specified in dfs.datanode.max.locked.memory configuration property.
As already told, NameNode coordinates caching. When a new entry is expected to be written, NameNode handles client's requests and defines cached blocks from specified path. Such formatted information is persisted in edit logs. When a DataNode sends a heartbeat request, NameNode checks if this DataNode should cache or uncache some blocks. If it's the case, the NameNode sends appropriate instruction to DataNode.
NameNode tracks cache state through cache reports. Similarly to block reports, cache reports are sent by DataNodes to NameNode and contain the list of cached blocks. After receiving this report, the NameNode responds to DataNode with cache instructions to execute.
Directives and pools
Conceptually cache in HDFS is represented by 2 concepts: cache directive and cache pool. Cache directive represents the path to cache which can be a file or a directory. In the case of file, all its blocks are cached. In the case of directory, the blocks of all files are cached, but only at the 1st level - caching is not recursive. Each directive is characterized by 2 parameters: replication factor and expiration time.
In the other side, cache pool defines the set of cached resources. It's described by: UNIX-like permissions, set of managed cache directives, maximum size and maximum time-to-live (directive's TTL can't be bigger than this value).
Cache management
Cache can be manipulated with cacheadmin command. It allows to manage directives as well as pools. Below you can find an example of directive and pool definition:
# Copy not-empty file from local FS to HDFS hadoop fs -copyFromLocal ~/code/bigdata/hadoop/tested_file.txt /tested_file.txt # Create new pool keeping cached data maximaly during 2 minutes hdfs cacheadmin -addPool text_files -maxTtl 120s Successfully added cache pool text_files. # Check if pool was added hdfs cacheadmin -listPools -stats Found 1 result. NAME OWNER GROUP MODE LIMIT MAXTTL BYTES_NEEDED BYTES_CACHED BYTES_OVERLIMIT FILES_NEEDED FILES_CACHED text_files bartosz bartosz rwxr-xr-x unlimited 000:00:02:00.000 0 0 0 0 0 # Create new directive hdfs cacheadmin -addDirective -path /tested_file.txt -pool text_files -ttl 60s Added cache directive 1 # Check if directive was added hdfs cacheadmin -listDirectives Found 1 entry ID POOL REPL EXPIRY PATH 1 text_files 1 2016-12-04T13:46:55+0100 /tested_file.txt # Check if pool changed hdfs cacheadmin -listPools -stats Found 1 result. NAME OWNER GROUP MODE LIMIT MAXTTL BYTES_NEEDED BYTES_CACHED BYTES_OVERLIMIT FILES_NEEDED FILES_CACHED text_files bartosz bartosz rwxr-xr-x unlimited 000:00:02:00.000 7 0 0 1 0 # After waiting 1 minut (TTL of cached file) # the pool looks like: hdfs cacheadmin -listPools -stats Found 1 result. NAME OWNER GROUP MODE LIMIT MAXTTL BYTES_NEEDED BYTES_CACHED BYTES_OVERLIMIT FILES_NEEDED FILES_CACHED text_files bartosz bartosz rwxr-xr-x unlimited 000:00:02:00.000 0 0 0 0 0
We can also see what happens thanks to logged events on NameNode's side:
INFO org.apache.hadoop.hdfs.server.namenode.CacheManager: addCachePool of {poolName:text_files, ownerName:null, groupName:null, mode:null, limit:null, maxRelativeExpiryMs:120000} successful. INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true ugi=bartosz (auth:SIMPLE) ip=/127.0.0.1 cmd=addCachePool src={poolName:text_files, ownerName:bartosz, groupName:bartosz, mode:0755, limit:9223372036854775807, maxRelativeExpiryMs:120000} dst=null perm=null proto=rpc # Cache report content DEBUG BlockStateChange: *BLOCK* NameNode.cacheReport: from DatanodeRegistration(127.0.0.1:50010, datanodeUuid=18a77779-6605-468b-ab90-bd8ae5579062, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-876fa79c-2cc6-483f-aa7e-12818ea211ab;nsid=946764079;c=0) 0 blocks INFO org.apache.hadoop.hdfs.server.namenode.CacheManager: addDirective of {path: /tested_file.txt, pool: text_files, expiration: 000:00:01:00.000} successful. INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true ugi=bartosz (auth:SIMPLE) ip=/127.0.0.1 cmd=addCacheDirective src={id: 1, path: /tested_file.txt, replication: 1, pool: text_files, expiration: 2016-12-04T13:46:55+0100} dst=null perm=null proto=rpc # Cache report, this time contains block 2016-12-04 13:46:46,453 DEBUG BlockStateChange: *BLOCK* NameNode.cacheReport: from DatanodeRegistration(127.0.0.1:50010, datanodeUuid=18a77779-6605-468b-ab90-bd8ae5579062, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-876fa79c-2cc6-483f-aa7e-12818ea211ab;nsid=946764079;c=0) 1 blocks # Block report after expiration time 2016-12-04 13:46:58,454 DEBUG BlockStateChange: *BLOCK* NameNode.cacheReport: from DatanodeRegistration(127.0.0.1:50010, datanodeUuid=18a77779-6605-468b-ab90-bd8ae5579062, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-876fa79c-2cc6-483f-aa7e-12818ea211ab;nsid=946764079;c=0) 1 blocks
And below how DataNode handles cache requests:
2016-12-04 13:46:04,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_CACHE for BP-657258792-127.0.1.1-1480941885218 of [1073741825] 2016-12-04 13:47:04,454 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_UNCACHE for BP-657258792-127.0.1.1-1480941885218 of [1073741825]
This post shows caching feature in HDFS. The first part explains how NameNode coordinates this task. It appears to be almost the same thing as in the case of blocks placement. The second part explains what are cache directives and cache pools. We can learn that the directives represents cached elements (files or directories) and pools are logical groups of directives. The last part shows how to manipulate cache through cacheadmin command. It also shows, through some logs, the proof about the existence of cache reports and roles separation between NameNode and DataNode.
Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions.
As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and
drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!
👉 contact@waitingforcode.com
đź”— past projects