States in HDFS

on waitingforcode.com

States in HDFS

Replica and blocks are HDFS entities having more than 1 state in their whole lifecycle. Being aware of these states helps a lot to understand what happens when a new file is written.

This post is divided in 3 parts. The first part describes states available for replicas. The second part shows states available for blocks. The last part will show log events containing replica and block related work.

Replica states

Replica can be in one of following states:

  • finalized - for this state the replica is fully written and closed. Its data and meta data are consistent.
  • being written (RBW) - corresponds to the replica of the last block of unclosed file. Generally the replica goes to this state every time when it's just created or reopened to append. Other replicas belonging to the same block can differ from current replica. RBW replica is visible for readers.
  • waiting to be recovered (RWR) - represents the state when DataNode restarts or dies. When it happens, all replicas held by this DataNode goes to RWR state. After restarting RWR replica can become outdated. If it happens, it will be discarded. In the other case, it will participate in recovery process.
  • under recovery (RUR) - replica is in this state when its recovery is triggered by the lease expiration. This situation is called lease recovery .
  • temporary - replicas in this state are created only for replication and balancing. It looks almost like RBW replica with the difference that it's not visible for readers.

Lease recovery

Lease recovery - lease represents a kind of lock that each client must acquire before be allowed to write that file. This lock must be renewed regular period of time and if it's not the case, it expires and the file is available for writes for other clients. When no signal is sent by the client upon defined time, his lease must be destroyed and file must be make available for writes. This process of making file available for writes is called lease recovery. At its end, depending on the state of the last and the second-to-last block, the file is closed or its blocks are recovered.

Replica states are physically stored in different directories on disk (tmp for temporary, rur for RUR and so on). The first replica is created in tmp directory with temporary state. Once finalized, it's moved to current directory as finalized replica.

A different logic occurs when DataNode restarts. Temporary replicas are discarded (tmp directory is cleared) and all RBW replicas located in rbw directory are loaded as RWR replicas.

Block states

Also block can be in different state:

  • under construction - it's the last block of unclosed file on which data is written (file creation, append). Data is visible to readers.
  • under recovery - occurs for the case of lease recovery. If the lease expires and the last block of leased file is not in complete state, it must pass by recovery procedure consisting on synchronizing its replicas contents.
  • committed - it has finalized its bytes (no data is expected to be written) and generation stamp (GS). But the block is not finalized yet. It means that the number of finalized replicas is lower than the specified minimum replication level.

    This state implies also a constraint for a client. When the client asks to close a file, the last and the second-to-last blocks can't be in committed state. If at least one of them is in this state, the file can't be closed and the client must retry.
  • complete - applies for the case when the minimal number of replicas are in finalized state. When all blocks of a file are in this state, the file can be closed.

For the case of blocks, state transitions workflow is simpler. First, the block goes to the under construction state. Once constructed, the client can trigger file closing or asks to write new block. In both cases the block passes to committed state. Committed blocks goes to complete state when the minimal number of replica are finalized.

Below diagram shows described transitions of block states:

From time to time block doesn't pass from under construction to committed state directly. Instead it passes by intermediary phase of under recovery which can lead the block either to committed or completed state.

Generation Stamp

Generation Stamp - 8-bytes number maintained by NameNode and associated with each block, representing the current state of block. It changes when: file is created, file is reopened for append/truncate, lease recovery is initalized and when client has problems to write to given file and demands new GS to solve this issue.

States in logs

Below a simple command copying file from local file system to HDFS:

hadoop fs -copyFromLocal ~/tested_file.txt /copied_file.txt

By inspecting the logs of NameNode, we can easily find the entries where the blocks of copied file pass from under construction to complete state:

# Block is under construct and its replica is being written
12:39:49,353 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741825_1001{UCState=UNDER_CONSTRUCTION, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-8d7d4a0d-6cd0-4f4e-8638-fa691b4316e0:NORMAL:127.0.0.1:50010|RBW]]} for /copied_file.txt._COPYING_

# Block is committed but the replica is still being written
12:39:49,725 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1073741825_1001{UCState=COMMITTED, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-8d7d4a0d-6cd0-4f4e-8638-fa691b4316e0:NORMAL:127.0.0.1:50010|RBW]]} is not COMPLETE (ucState = COMMITTED, replication# = 0 <  minimum = 1) in file /copied_file.txt._COPYING_

12:39:49,743 DEBUG BlockStateChange: *BLOCK* NameNode.blockReceivedAndDeleted: from DatanodeRegistration(127.0.0.1:50010, datanodeUuid=dce331d0-4a34-4c24-9da0-dc5aacab
a8f2, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-5a839edf-4e9f-481c-8308-83a012b2370a;nsid=2115806237;c=0) 1 blocks.

12:39:49,745 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to blk_1073741825_1001{UCState=COMMITTED, truncateBlock=null, pri
maryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-8d7d4a0d-6cd0-4f4e-8638-fa691b4316e0:NORMAL:127.0.0.1:50010|RBW]]} size 7

12:39:49,746 DEBUG BlockStateChange: BLOCK* block RECEIVED_BLOCK: blk_1073741825_1001 is received from 127.0.0.1:50010
12:39:49,746 DEBUG BlockStateChange: *BLOCK* NameNode.processIncrementalBlockReport: from 127.0.0.1:50010 receiving: 0, received: 1, deleted: 0

# Finally the block is completed
12:39:50,154 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /copied_file.txt._COPYING_ is closed by DFSClient_NONMAPREDUCE_2036165751_1

Directory tree of HDFS files looks like in below output:

hdfs_dir/
├── data_blocks
│   ├── current
│   │   ├── BP-1313756114-127.0.1.1-1481283425555
│   │   │   ├── current
│   │   │   │   ├── finalized
│   │   │   │   │   └── subdir0
│   │   │   │   │       └── subdir0
│   │   │   │   │           ├── blk_1073741825
│   │   │   │   │           └── blk_1073741825_1001.meta
│   │   │   │   ├── rbw
│   │   │   │   └── VERSION
│   │   │   ├── scanner.cursor
│   │   │   └── tmp
│   │   └── VERSION
│   └── in_use.lock
├── editlog
│   ├── current
│   │   ├── edits_0000000000000000001-0000000000000000002
│   │   ├── edits_inprogress_0000000000000000003
│   │   ├── seen_txid
│   │   └── VERSION
│   └── in_use.lock
└── fsimage
    ├── current
    │   ├── fsimage_0000000000000000000
    │   ├── fsimage_0000000000000000000.md5
    │   ├── fsimage_0000000000000000002
    │   ├── fsimage_0000000000000000002.md5
    │   ├── seen_txid
    │   └── VERSION
    └── in_use.lock

You can easily see rbw, current and tmp directories described in previous part.

This post describes the states of 2 related entities: replicas and blocks. We can see that their states are mostly influenced by the fact of writing them and that sometimes they can be in exceptional situation related to lease recovery. This post also introduced some new terms, such as lease recovery, which were shortly described here and will be explained more in details in other posts.

Share on: