HDFS on disk explained

Versions: Hadoop 2.7.2

Among all previous posts we could learn a lot about HDFS transaction logs, operations on closed files and so on. Thanks to that we can take a look now on data structure of NameNode and DataNode.

This post is divided in 2 parts. The first part explains the files used by DataNode. The second part concerns NameNode and lists its components.

Analyzed directory structure is produced with below configuration entries and after copying 1 file from local file system:

<property>
  <name>dfs.namenode.name.dir</name>
  <value>/home/bartosz/hadoop/hdfs_dir/fsimage</value>
</property>
<property>
  <name>dfs.namenode.edits.dir</name>
  <value>/home/bartosz/hadoop/hdfs_dir/editlog</value>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>/home/bartosz/hadoop/hdfs_dir/data_blocks</value>
</property>

Whole analyzed directory structure looks like:

hdfs_dir/
β”œβ”€β”€ data_blocks
β”‚Β Β  └── current
β”‚Β Β      β”œβ”€β”€ BP-1817513253-127.0.1.1-1481542921087
β”‚Β Β      β”‚Β Β  β”œβ”€β”€ current
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”œβ”€β”€ dfsUsed
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”œβ”€β”€ finalized
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”‚Β Β  └── subdir0
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”‚Β Β      └── subdir0
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”‚Β Β          β”œβ”€β”€ blk_1073741825
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”‚Β Β          β”œβ”€β”€ blk_1073741825_1002.meta
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”‚Β Β          β”œβ”€β”€ blk_1073741826
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”‚Β Β          └── blk_1073741826_1004.meta
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”œβ”€β”€ rbw
β”‚Β Β      β”‚Β Β  β”‚Β Β  └── VERSION
β”‚Β Β      β”‚Β Β  β”œβ”€β”€ scanner.cursor
β”‚Β Β      β”‚Β Β  └── tmp
β”‚Β Β      └── VERSION
β”œβ”€β”€ editlog
β”‚Β Β  └── current
β”‚Β Β      β”œβ”€β”€ edits_0000000000000000001-0000000000000000008
β”‚Β Β      β”œβ”€β”€ edits_inprogress_0000000000000000009
β”‚Β Β      β”œβ”€β”€ seen_txid
β”‚Β Β      └── VERSION
└── fsimage
    └── current
        β”œβ”€β”€ fsimage_0000000000000000000
        β”œβ”€β”€ fsimage_0000000000000000000.md5
        β”œβ”€β”€ fsimage_0000000000000000008
        β”œβ”€β”€ fsimage_0000000000000000008.md5
        β”œβ”€β”€ seen_txid
        └── VERSION

DataNode directory

Let's analyze this structure part by part, starting with data_blocks subdirectory:

β”œβ”€β”€ data_blocks
β”‚Β Β  └── current
β”‚Β Β      β”œβ”€β”€ BP-1817513253-127.0.1.1-1481542921087
β”‚Β Β      β”‚Β Β  β”œβ”€β”€ current
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”œβ”€β”€ dfsUsed
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”œβ”€β”€ finalized
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”‚Β Β  └── subdir0
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”‚Β Β      └── subdir0
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”‚Β Β          β”œβ”€β”€ blk_1073741825
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”‚Β Β          β”œβ”€β”€ blk_1073741825_1002.meta
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”‚Β Β          β”œβ”€β”€ blk_1073741826
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”‚Β Β          └── blk_1073741826_1004.meta
β”‚Β Β      β”‚Β Β  β”‚Β Β  β”œβ”€β”€ rbw
β”‚Β Β      β”‚Β Β  β”‚Β Β  └── VERSION
β”‚Β Β      β”‚Β Β  β”œβ”€β”€ scanner.cursor
β”‚Β Β      β”‚Β Β  └── tmp
β”‚Β Β      └── VERSION

As indicated in the dfs.datanode.data.dir configuration entry, this subdirectory is responsible for holding all data related to DataNode. At the first level we can find a subdirectory current containing all currently used files. It contains a file called VERSION describing directory structure. Its sample content looks like (with comments ahead each property explaining its its purpose) :

# Unique ID used in the cluster for storage 
storageID=DS-b2084b71-fe36-4a2e-9dd8-dc1b1094de7c
# Unique ID for cluster
clusterID=CID-6133e6a0-7312-4f20-85ac-3545cecc5bfd
# Creation time of file system state, upgraded every time 
# when HDFS is upgraded. After formatting NameNode, this value
# is equal to 0
cTime=0
# DataNode's UUID generated on initialization
datanodeUuid=2e5f9bdf-f444-4a62-88c3-15882e84e1c9
# Storage type
storageType=DATA_NODE
# The version of HDFS metadata format. When new HDFS feature
# needs to change the metadata format, this number is upgraded
# with the release
layoutVersion=-56

After that we can find a directory beginning by BP-. This abbreviation means Block Pool and it collects a set of block under one namespace. It also has a current subdirectory in which we can find directories for finalized and still written (rbw) blocks. finalized subdirectory contains closed blocks (file raw bytes) with their appropriated metadata. 2-levels directory structure composed by directories beginning with subdir contain blocks (more exactly, they are stored by the 2nd level directories). The choice of the final place is not made by chance. The location of each block's directory is done in DatanodeUtil's idToBlockDir function through 9th and 17th bits of block id:

public static File idToBlockDir(File root, long blockId) {
  int d1 = (int)((blockId >> 16) & 0xff);
  int d2 = (int)((blockId >> 8) & 0xff);
  String path = DataStorage.BLOCK_SUBDIR_PREFIX + d1 + SEP +
      DataStorage.BLOCK_SUBDIR_PREFIX + d2;
  return new File(root, path);
}

dfsUsed file is a cache file containing the result for disk space usage statistics. If it's sufficently fresh (10 minutes), it's used as a response to du command.

At the same level as current is located tmp directory grouping temporary files.

NameNode structure

NameNode structure is less complicated:

β”œβ”€β”€ editlog
β”‚Β Β  └── current
β”‚Β Β      β”œβ”€β”€ edits_0000000000000000001-0000000000000000008
β”‚Β Β      β”œβ”€β”€ edits_inprogress_0000000000000000009
β”‚Β Β      β”œβ”€β”€ seen_txid
β”‚Β Β      └── VERSION
└── fsimage
    └── current
        β”œβ”€β”€ fsimage_0000000000000000000
        β”œβ”€β”€ fsimage_0000000000000000000.md5
        β”œβ”€β”€ fsimage_0000000000000000008
        β”œβ”€β”€ fsimage_0000000000000000008.md5
        β”œβ”€β”€ seen_txid
        └── VERSION

The editlog's current subdirectory contains currently used edit logs by NameNode. Some of them (the ones not having _inprogress_ in their names) are finalized and immutable files. Their names contains 2 numbers: the last one represents the end transaction id stored in the file (0000000000000000008 in our example). The before-to-last number tells from which transaction id events are stored (0000000000000000001 in the case).

The file starting with edits_inprogress_ marks currently used edit logs file with the start transaction id. The file called seen_txid has the id of last seen transaction. "Seen" means here that the transaction concerns either the last checkpoint or edit log roll (occurs when current edits_inprogress_ file is closed and new "in progress" transactions log is created at its place).

The VERSION file's content is similar to the one from DataNode:

namespaceID=870060382
clusterID=CID-6133e6a0-7312-4f20-85ac-3545cecc5bfd
cTime=0
storageType=NAME_NODE
blockpoolID=BP-1817513253-127.0.1.1-1481542921087
layoutVersion=-63

The fsimage part is similar to the editlog's one. It contains a current subdirectory keeping all currently available FSImage files. Each file of this type starts its name with fsimage and ends with the ID of the last transaction stored on it. They also have corresponding checksum files, helping to prevent against data corruption.

After that, the purpose of seen_txid is the same - it holds ID of the last transaction. VERSION's content is the same as in the case of edit logs:

namespaceID=870060382
clusterID=CID-6133e6a0-7312-4f20-85ac-3545cecc5bfd
cTime=0
storageType=NAME_NODE
blockpoolID=BP-1817513253-127.0.1.1-1481542921087
layoutVersion=-63

This post explains the physical files composing HDFS. The first part describes the components of DataNode: block pools, block location choice and directory structure. The second part presents how NameNode stores its files on disk: edit logs and FSImage.