Edit log in HDFS

Versions: Hadoop 2.7.2

HDFS stores everything that happens on transaction log files. They're used during checkpoint and file system recovery. So, they take quite important place in HDFS architecture.

This post describes this logical structure called edit log. The first part defines edit logs and shows for what they're used in HDFS. The second part presents how these files are stored on disk. The last part explores the content of edit logs thanks converter tool provided with HDFS.

Edit log definition

Edit log is a logical structure behaving as transaction logs. It's stored by NameNode's directory configured in dfs.namenode.edits.dir property. Physically edit log is composed by several files called segments. At given moment, only 1 segment is active, i.e. it's the single one which accepts new writing operations. Unlike the inactive ones, its name starts with edits_inprogress_. The sync and flush of this file is done before the transaction ends, i.e. before the client receives the response for his action.

What information is logged to edit logs ? Globally, everything that concerns file system: file or directory creation, file metadata changes (permissions, replication) and so on. The list of logged operations can be found in org.apache.hadoop.hdfs.server.namenode.FSEditLog class. This class represents edit logs and has a lot of method those names begin with logX (X is logged action). These actions are appended to the current active segment.

Each logged action is associated with a transaction ID. It's visible in FSEditLog's method responsible for appending new events to active segment:

void logEdit(final FSEditLogOp op) {
  synchronized (this) {
    assert isOpenForWrite() :
      "bad state: " + state;
    
    // wait if an automatic sync is scheduled
    waitIfAutoSyncScheduled();
    
    long start = beginTransaction();
    op.setTransactionId(txid);

    try {
      editLogStream.write(op);
    } catch (IOException ex) {
      // All journals failed, it is handled in logSync.
    } finally {
      op.reset();
    }

    endTransaction(start);
    
    // check if it is time to schedule an automatic sync
    if (!shouldForceSync()) {
      return;
    }
    isAutoSyncScheduled = true;
  }
  
  // sync buffered edit log entries to persistent store
  logSync();
}

To prevent logs to grow unmanageably, HDFS uses a special node called Secondary NameNode (SNN). This node regularly - every X registered events or every X minutes from the last rollback - rollbacks current segment. This rollback procedure is called checkpoint and is described in details in the post about checkpoint in HDFS. Now we can simply know that checkpoint consists on closing current segment and creating a new, empty one. The content of the previous segment is merged with another NameNode's structure called FSImage (described in the post about FSImage in HDFS).

Edit log segments

To better understand edit logs, let's see how they are stored physically:

hdfs_dir/editlog/current/
├── edits_0000000000000000001-0000000000000000002
├── edits_0000000000000000003-0000000000000000004
├── edits_0000000000000000005-0000000000000000006
├── edits_0000000000000000007-0000000000000000008
// similar lines omitted for brevity  
├── edits_0000000000000000159-0000000000000000160
├── edits_0000000000000000161-0000000000000000162
├── edits_0000000000000000163-0000000000000000164
├── edits_inprogress_0000000000000000165
├── seen_txid
└── VERSION

0 directories, 82 files

As told in previous part, current segment is prefixed with edits_inprogress. The number following this prefix represents the ID of the first stored transaction. The other files beginning with edits_ has also the indication about kept transactions. The first number after edits_ prefix is the ID of start transaction. Naturally, the last is the ID of end transaction. These files are finalized segments and they can't be modified. Apart from segments, there are 2 other files: seen_txid storing the end transaction ID of last closed segment and VERSION holding edit log metadata (namespace ID, blockpool ID, storage type, HDFS metadata version).

Edit log content

Segments are stored as binary files and we can read them thanks to Offline Edits Viewer. This tool can parse segment to more readable format, such as XML:

current/bin/hdfs oev -i ~/hdfs_dir/editlog/current/edits_0000000000000000033-0000000000000000038 -o ./editlog_sample_read.xml

The root of converted XML file is tag, containing and a list of tags representing logged operations:

<?xml version="1.0" encoding="UTF-8"?>
<EDITS>
  <EDITS_VERSION>-63</EDITS_VERSION>
  <RECORD>
    <OPCODE>OP_START_LOG_SEGMENT</OPCODE>
    <DATA>
      <TXID>33</TXID>
    </DATA>
  </RECORD>
  <RECORD>
    <OPCODE>OP_ADD</OPCODE>
    <DATA>
      <TXID>34</TXID>
      <LENGTH>0</LENGTH>
      <INODEID>16386</INODEID>
      <PATH>/file1.txt</PATH>
      <REPLICATION>1</REPLICATION>
      <MTIME>1485076295343</MTIME>
      <ATIME>1485076295343</ATIME>
      <BLOCKSIZE>134217728</BLOCKSIZE>
      <CLIENT_NAME>DFSClient_NONMAPREDUCE_564332931_1</CLIENT_NAME>
      <CLIENT_MACHINE>127.0.0.1</CLIENT_MACHINE>
      <OVERWRITE>true</OVERWRITE>
      <PERMISSION_STATUS>
        <USERNAME>bartosz</USERNAME>
        <GROUPNAME>supergroup</GROUPNAME>
        <MODE>420</MODE>
      </PERMISSION_STATUS>
      <RPC_CLIENTID>bd293c42-83b7-4eb5-baf7-fb0c5b2a3d81</RPC_CLIENTID>
      <RPC_CALLID>2</RPC_CALLID>
    </DATA>
  </RECORD>
  <RECORD>
    <OPCODE>OP_CLOSE</OPCODE>
    <DATA>
      <TXID>35</TXID>
      <LENGTH>0</LENGTH>
      <INODEID>0</INODEID>
      <PATH>/file1.txt</PATH>
      <REPLICATION>1</REPLICATION>
      <MTIME>1485076295343</MTIME>
      <ATIME>1485076295343</ATIME>
      <BLOCKSIZE>134217728</BLOCKSIZE>
      <CLIENT_NAME></CLIENT_NAME>
      <CLIENT_MACHINE></CLIENT_MACHINE>
      <OVERWRITE>false</OVERWRITE>
      <PERMISSION_STATUS>
        <USERNAME>bartosz</USERNAME>
        <GROUPNAME>supergroup</GROUPNAME>
        <MODE>420</MODE>
      </PERMISSION_STATUS>
    </DATA>
  </RECORD>
  <RECORD>
    <OPCODE>OP_ADD</OPCODE>
    <DATA>
      <TXID>36</TXID>
      <LENGTH>0</LENGTH>
      <INODEID>16387</INODEID>
      <PATH>/file2.txt</PATH>
      <REPLICATION>1</REPLICATION>
      <MTIME>1485076295343</MTIME>
      <ATIME>1485076295343</ATIME>
      <BLOCKSIZE>134217728</BLOCKSIZE>
      <CLIENT_NAME>DFSClient_NONMAPREDUCE_-1584596381_1</CLIENT_NAME>
      <CLIENT_MACHINE>127.0.0.1</CLIENT_MACHINE>
      <OVERWRITE>true</OVERWRITE>
      <PERMISSION_STATUS>
        <USERNAME>bartosz</USERNAME>
        <GROUPNAME>supergroup</GROUPNAME>
        <MODE>420</MODE>
      </PERMISSION_STATUS>
      <RPC_CLIENTID>a9380ebc-437f-4c39-ace1-d29691906127</RPC_CLIENTID>
      <RPC_CALLID>2</RPC_CALLID>
    </DATA>
  </RECORD>
  <RECORD>
    <OPCODE>OP_CLOSE</OPCODE>
    <DATA>
      <TXID>37</TXID>
      <LENGTH>0</LENGTH>
      <INODEID>0</INODEID>
      <PATH>/file2.txt</PATH>
      <REPLICATION>1</REPLICATION>
      <MTIME>1485076295343</MTIME>
      <ATIME>1485076295343</ATIME>
      <BLOCKSIZE>134217728</BLOCKSIZE>
      <CLIENT_NAME></CLIENT_NAME>
      <CLIENT_MACHINE></CLIENT_MACHINE>
      <OVERWRITE>false</OVERWRITE>
      <PERMISSION_STATUS>
        <USERNAME>bartosz</USERNAME>
        <GROUPNAME>supergroup</GROUPNAME>
        <MODE>420</MODE>
      </PERMISSION_STATUS>
    </DATA>
  </RECORD>
  <RECORD>
    <OPCODE>OP_END_LOG_SEGMENT</OPCODE>
    <DATA>
      <TXID>38</TXID>
    </DATA>
  </RECORD>
</EDITS>

Record contains already mentioned file metadata, such as: length, path, replication factor, creation date, permissions or RPC call information. As told, each action is associated with transaction id.

Through this post we can learn that all actions on HDFS are logged to special files called edit log. There are 2 log types: active and finalized. Active file contains is writable. It's transformed to finalized (not writable) logs only when checkpoint operation occurs. We also saw that Secondary NameNode is used to prevent current segment to grow to quick. The 2 last parts shown how edit logs are stored on disk and what they contain.