States in HDFS

Versions: Hadoop 2.7.2

Replica and blocks are HDFS entities having more than 1 state in their whole lifecycle. Being aware of these states helps a lot to understand what happens when a new file is written.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ contact@waitingforcode.com πŸ“©

This post is divided in 3 parts. The first part describes states available for replicas. The second part shows states available for blocks. The last part will show log events containing replica and block related work.

Replica states

Replica can be in one of following states:

Lease recovery

Lease recovery - lease represents a kind of lock that each client must acquire before be allowed to write that file. This lock must be renewed regular period of time and if it's not the case, it expires and the file is available for writes for other clients. When no signal is sent by the client upon defined time, his lease must be destroyed and file must be make available for writes. This process of making file available for writes is called lease recovery. At its end, depending on the state of the last and the second-to-last block, the file is closed or its blocks are recovered.

Replica states are physically stored in different directories on disk (tmp for temporary, rur for RUR and so on). The first replica is created in tmp directory with temporary state. Once finalized, it's moved to current directory as finalized replica.

A different logic occurs when DataNode restarts. Temporary replicas are discarded (tmp directory is cleared) and all RBW replicas located in rbw directory are loaded as RWR replicas.

Block states

Also block can be in different state:

For the case of blocks, state transitions workflow is simpler. First, the block goes to the under construction state. Once constructed, the client can trigger file closing or asks to write new block. In both cases the block passes to committed state. Committed blocks goes to complete state when the minimal number of replica are finalized.

Below diagram shows described transitions of block states:

From time to time block doesn't pass from under construction to committed state directly. Instead it passes by intermediary phase of under recovery which can lead the block either to committed or completed state.

Generation Stamp

Generation Stamp - 8-bytes number maintained by NameNode and associated with each block, representing the current state of block. It changes when: file is created, file is reopened for append/truncate, lease recovery is initalized and when client has problems to write to given file and demands new GS to solve this issue.

States in logs

Below a simple command copying file from local file system to HDFS:

hadoop fs -copyFromLocal ~/tested_file.txt /copied_file.txt

By inspecting the logs of NameNode, we can easily find the entries where the blocks of copied file pass from under construction to complete state:

# Block is under construct and its replica is being written
12:39:49,353 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741825_1001{UCState=UNDER_CONSTRUCTION, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-8d7d4a0d-6cd0-4f4e-8638-fa691b4316e0:NORMAL:127.0.0.1:50010|RBW]]} for /copied_file.txt._COPYING_

# Block is committed but the replica is still being written
12:39:49,725 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1073741825_1001{UCState=COMMITTED, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-8d7d4a0d-6cd0-4f4e-8638-fa691b4316e0:NORMAL:127.0.0.1:50010|RBW]]} is not COMPLETE (ucState = COMMITTED, replication# = 0 <  minimum = 1) in file /copied_file.txt._COPYING_

12:39:49,743 DEBUG BlockStateChange: *BLOCK* NameNode.blockReceivedAndDeleted: from DatanodeRegistration(127.0.0.1:50010, datanodeUuid=dce331d0-4a34-4c24-9da0-dc5aacab
a8f2, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-5a839edf-4e9f-481c-8308-83a012b2370a;nsid=2115806237;c=0) 1 blocks.

12:39:49,745 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to blk_1073741825_1001{UCState=COMMITTED, truncateBlock=null, pri
maryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-8d7d4a0d-6cd0-4f4e-8638-fa691b4316e0:NORMAL:127.0.0.1:50010|RBW]]} size 7

12:39:49,746 DEBUG BlockStateChange: BLOCK* block RECEIVED_BLOCK: blk_1073741825_1001 is received from 127.0.0.1:50010
12:39:49,746 DEBUG BlockStateChange: *BLOCK* NameNode.processIncrementalBlockReport: from 127.0.0.1:50010 receiving: 0, received: 1, deleted: 0

# Finally the block is completed
12:39:50,154 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /copied_file.txt._COPYING_ is closed by DFSClient_NONMAPREDUCE_2036165751_1

Directory tree of HDFS files looks like in below output:

hdfs_dir/
β”œβ”€β”€ data_blocks
β”‚Β Β  β”œβ”€β”€ current
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ BP-1313756114-127.0.1.1-1481283425555
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ current
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ finalized
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β  └── subdir0
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β      └── subdir0
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β          β”œβ”€β”€ blk_1073741825
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β          └── blk_1073741825_1001.meta
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ rbw
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”‚Β Β  └── VERSION
β”‚Β Β  β”‚Β Β  β”‚Β Β  β”œβ”€β”€ scanner.cursor
β”‚Β Β  β”‚Β Β  β”‚Β Β  └── tmp
β”‚Β Β  β”‚Β Β  └── VERSION
β”‚Β Β  └── in_use.lock
β”œβ”€β”€ editlog
β”‚Β Β  β”œβ”€β”€ current
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ edits_0000000000000000001-0000000000000000002
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ edits_inprogress_0000000000000000003
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ seen_txid
β”‚Β Β  β”‚Β Β  └── VERSION
β”‚Β Β  └── in_use.lock
└── fsimage
    β”œβ”€β”€ current
    β”‚Β Β  β”œβ”€β”€ fsimage_0000000000000000000
    β”‚Β Β  β”œβ”€β”€ fsimage_0000000000000000000.md5
    β”‚Β Β  β”œβ”€β”€ fsimage_0000000000000000002
    β”‚Β Β  β”œβ”€β”€ fsimage_0000000000000000002.md5
    β”‚Β Β  β”œβ”€β”€ seen_txid
    β”‚Β Β  └── VERSION
    └── in_use.lock

You can easily see rbw, current and tmp directories described in previous part.

This post describes the states of 2 related entities: replicas and blocks. We can see that their states are mostly influenced by the fact of writing them and that sometimes they can be in exceptional situation related to lease recovery. This post also introduced some new terms, such as lease recovery, which were shortly described here and will be explained more in details in other posts.