FSImage in HDFS

Versions: Hadoop 2.7.2

Edit log would be useless without its complementary structure called FSImage.

A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free

👉 Register here

This post describes main characteristics of FSImage. The first part shows its main features. The second part explores FSImage structure thanks to logs lookup and Hadoops's Offline Edits Viewer tool.

FSImage definition

FSImage is an abbreviation for File System Image. As the full name indicates, this structure reflects the complete state of the file system at a point in time. This state is represented by the set of metadata (size, path, owner, group, permissions or block size) describing stored files. It also defines files tree, ie. which directory is the root, which are its children, if these children have other children directories and so on. FSImage doesn't contain any mapping between files and blocks. This mapping is constructed when NameNode asks DataNodes which blocks they hold, and is stored in NameNode's memory.

As for edit logs, HDFS can store multiple FSImage files. By default, the number of stored files is 2 but this value can be changed through dfs.namenode.num.checkpoints.retained configuration property. The naming format of these files is fsimage_TRANSACTION_ID, where TRANSACTION_ID represents the last transaction merged from edit logs.

Exploring FSImage

To explore FSImage we follow the same procedure as for edit logs. First, let's see how FSImages are stored on disk:

hdfs_dir/fsimage/current/
├── fsimage_0000000000000000002
├── fsimage_0000000000000000002.md5
├── fsimage_0000000000000000020
├── fsimage_0000000000000000020.md5
├── seen_txid
└── VERSION

0 directories, 6 files

There are 2 FSimage files, one containing the transactions 0-2 and the second with the transactions 0-20. Both have a checksum file (.md5 extension) helping to detect if they're not corrupted. seen_txid file contains an integer representing the last transaction ID of the last checkpoint. The last file, VERSION, defines file system information, such as: the version of HDFS metadata format, the ID of the cluster, the ID of managed blockpool, creation date and storage type (NAME_NODE or JOURNAL_MODE).

To understand the content of FSImage, we can settle for the Javadoc comment of org.apache.hadoop.hdfs.server.namenode.FSImageFormat class:

FSImage {
   layoutVersion: int, namespaceID: int, numberItemsInFSDirectoryTree: long,
   namesystemGenerationStampV1: long, namesystemGenerationStampV2: long,
   generationStampAtBlockIdSwitch:long, lastAllocatedBlockId:
   long transactionID: long, snapshotCounter: int, numberOfSnapshots: int,
   numOfSnapshottableDirs: int,
   {FSDirectoryTree, FilesUnderConstruction, SecretManagerState} (can be compressed)
}

FSDirectoryTree {
  # 2 versions depending on FSIMAGE_NAME_OPTIMIZATION support,
  # but globally looks like:
  [] INodeInfo
}

INodeInfo {
  # represented as INode class, contains information about given file, such as:
  id, root, permissions, group, directory or file flag, parent directory,
  space, access and modification time, block (id, size, generation stamp)
}

To see what FSImage can really contain, we can use Offline Image Viewer. This tool helps to convert binary FSImage file to its XML representation:

bin/hadoop fs -mkdir /articles
bin/hadoop fs -mkdir /images
bin/hadoop fs -mkdir /logs
bin/hadoop fs -touchz /articles/article1.txt
bin/hadoop fs -touchz /articles/article2.txt
bin/hadoop fs -touchz /articles/article3.txt

# You may have to wait some time before new FSImage is generated through
# checkpoint operation
bin/hdfs oiv -i ~/hdfs_dir/fsimage/current/fsimage_0000000000000000020 -o ./fsimage_sample.xml -p XML

Generated output looks like:

<?xml version="1.0" encoding="UTF-8"?>
<fsimage>
  <NameSection>
    <genstampV1>1000</genstampV1>
    <genstampV2>1000</genstampV2>
    <genstampV1Limit>0</genstampV1Limit>
    <lastAllocatedBlockId>1073741824</lastAllocatedBlockId>
    <txid>20</txid>
  </NameSection>
  <INodeSection>
    <lastInodeId>16394</lastInodeId>
    <inode>
      <id>16385</id>
      <type>DIRECTORY</type>
      <name />
      <mtime>1478427863000</mtime>
      <permission>bartosz:supergroup:rwxr-xr-x</permission>
      <nsquota>9223372036854775807</nsquota>
      <dsquota>-1</dsquota>
    </inode>
    <inode>
      <id>16386</id>
      <type>FILE</type>
      <name>file2.txt</name>
      <replication>1</replication>
      <mtime>1478427863001</mtime>
      <atime>1478427863002</atime>
      <perferredBlockSize>134217728</perferredBlockSize>
      <permission>bartosz:supergroup:rw-r--r--</permission>
    </inode>
    <inode>
      <id>16387</id>
      <type>FILE</type>
      <name>file1.txt</name>
      <replication>1</replication>
      <mtime>1478427863003</mtime>
      <atime>1478427863004</atime>
      <perferredBlockSize>134217728</perferredBlockSize>
      <permission>bartosz:supergroup:rw-r--r--</permission>
    </inode>
    <inode>
      <id>16388</id>
      <type>FILE</type>
      <name>file3.txt</name>
      <replication>1</replication>
      <mtime>1478427863005</mtime>
      <atime>1478427863006</atime>
      <perferredBlockSize>134217728</perferredBlockSize>
      <permission>bartosz:supergroup:rw-r--r--</permission>
    </inode>
    <inode>
      <id>16389</id>
      <type>DIRECTORY</type>
      <name>articles</name>
      <mtime>1478427863007</mtime>
      <permission>bartosz:supergroup:rwxr-xr-x</permission>
      <nsquota>-1</nsquota>
      <dsquota>-1</dsquota>
    </inode>
    <inode>
      <id>16390</id>
      <type>DIRECTORY</type>
      <name>images</name>
      <mtime>1478427863008</mtime>
      <permission>bartosz:supergroup:rwxr-xr-x</permission>
      <nsquota>-1</nsquota>
      <dsquota>-1</dsquota>
    </inode>
    <inode>
      <id>16391</id>
      <type>DIRECTORY</type>
      <name>logs</name>
      <mtime>1478427863009</mtime>
      <permission>bartosz:supergroup:rwxr-xr-x</permission>
      <nsquota>-1</nsquota>
      <dsquota>-1</dsquota>
    </inode>
    <inode>
      <id>16392</id>
      <type>FILE</type>
      <name>article1.txt</name>
      <replication>1</replication>
      <mtime>1478427863010</mtime>
      <atime>1478427863011</atime>
      <perferredBlockSize>134217728</perferredBlockSize>
      <permission>bartosz:supergroup:rw-r--r--</permission>
    </inode>
    <inode>
      <id>16393</id>
      <type>FILE</type>
      <name>article2.txt</name>
      <replication>1</replication>
      <mtime>1478427863012</mtime>
      <atime>1478427863013</atime>
      <perferredBlockSize>134217728</perferredBlockSize>
      <permission>bartosz:supergroup:rw-r--r--</permission>
    </inode>
    <inode>
      <id>16394</id>
      <type>FILE</type>
      <name>article3.txt</name>
      <replication>1</replication>
      <mtime>1478427863014</mtime>
      <atime>1478427863015</atime>
      <perferredBlockSize>134217728</perferredBlockSize>
      <permission>bartosz:supergroup:rw-r--r--</permission>
    </inode>
  </INodeSection>
  <INodeReferenceSection />
  <SnapshotSection>
    <snapshotCounter>0</snapshotCounter>
  </SnapshotSection>
  <INodeDirectorySection>
    <directory>
      <parent>16385</parent>
      <inode>16389</inode>
      <inode>16387</inode>
      <inode>16386</inode>
      <inode>16388</inode>
      <inode>16390</inode>
      <inode>16391</inode>
    </directory>
    <directory>
      <parent>16389</parent>
      <inode>16392</inode>
      <inode>16393</inode>
      <inode>16394</inode>
    </directory>
  </INodeDirectorySection>
  <FileUnderConstructionSection />
  <SnapshotDiffSection>
    <diff>
      <inodeid>16385</inodeid>
    </diff>
  </SnapshotDiffSection>
  <SecretManagerSection>
    <currentId>0</currentId>
    <tokenSequenceNumber>0</tokenSequenceNumber>
  </SecretManagerSection>
  <CacheManagerSection>
    <nextDirectiveId>1</nextDirectiveId>
  </CacheManagerSection>
</fsimage>

Generated FSImage shows well that this structure is similar to the output of tree -ughD --inodes command. We can retrieve there the information about particular files (INodeSection part) as well as the format of directories tree (reflected through INodeDirectorySection section). As you can see, this file doesn't have any information mapping file/blocks mapping. This mapping is constructed by NameNode and stored in memory.

FSImage is important for HDFS fault-tolerance. The first part introduces shortly what it contains. The second part shows that more in details by exploring directory structure and FSImage content. We can see that the output looks like the result of tree -ughD --inodes command because it defines directory structure and file basic information.