Making an immutable distributed file system is easier than building a mutable one. HDFS, even if initially was destined to not changing data, supports mutability through 2 operations: append and truncate.
Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote
one on that topic! You can read it online
on the O'Reilly platform,
or get a print copy on Amazon.
I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩
This post presents these 2 operations more in detail. The first shown operation concerns append. The second part describes truncate.
Append explained
Append operation consists on adding new data at the end of the file. Thus, the file changes its length and probably the number of blocks. Append algorithm in HDFS can be resumed in following steps:
- The client sends append request to the NameNode
- The NameNode checks if file is closed - otherwise append is not allowed. If the file is closed, it moves to Under Construction state.
- The NameNode checks the last block of file: if it's full, the NameNode initializes new block that will hold appended fragment. If the block is not full, it will be used to handle new data.
- The pipeline is resolved: for fully block a new pipeline is created and for not full block the pipeline associated with this block is taken.
- Data is written as in the case of file creation: within specified pipeline.
Single append is transparent for snapshots because only the lenght of modified file changes.
Below example shows the simple use case of append:
hadoop fs -copyFromLocal ~/tested_file.txt /copied_file.txt hadoop fs -appendToFile ~/tested_file.txt /copied_file.txt
Logs associated with this operation contain:
# NameNode part 12:44:07,139 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true ugi=bartosz (auth:SIMPLE) ip=/127.0.0.1 cmd=append src=/copied_file.txt dst=null perm=null proto=rpc # DataNode execution 12:44:07,327 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Appending to FinalizedReplica, blk_1073741825_1001, FINALIZED getNumBytes() = 7 getBytesOnDisk() = 7 getVisibleLength()= 7 getVolume() = /home/bartosz/hdfs_dir/data_blocks/current getBlockFile() = /home/bartosz/hdfs_dir/data_blocks/current/BP-1817513253-127.0.1.1-1481542921087/current/finalized/subdir0/subdir0/blk_1073741825 unlinked =false # NameNode creates pipeline 12:44:07,359 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: updatePipeline(blk_1073741825_1001, newGS=1002, newLength=7, newNodes=[127.0.0.1:50010], client=DFSClient_NONMAPREDUCE_772042077_1) 12:44:07,359 INFO BlockStateChange: BLOCK* Removing stale replica from location: [DISK]DS-b2084b71-fe36-4a2e-9dd8-dc1b1094de7c:NORMAL:127.0.0.1:50010 12:44:07,372 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: updatePipeline(blk_1073741825_1001 => blk_1073741825_1002) success # DataNode from the pipeline handles append request 12:44:07,380 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:34210, dest: /127.0.0.1:50010, bytes: 14, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_772042077_1, offset: 0, srvID: 2e5f9bdf-f444-4a62-88c3-15882e84e1c9, blockid: BP-1817513253-127.0.1.1-1481542921087:blk_1073741825_1002, duration: 45130390 12:44:07,380 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1817513253-127.0.1.1-1481542921087:blk_1073741825_1002, type=LAST_IN_PIPELINE, downstreams=0:[] terminating # NameNode confirms append 12:44:07,381 DEBUG BlockStateChange: *BLOCK* NameNode.blockReceivedAndDeleted: from DatanodeRegistration(127.0.0.1:50010, datanodeUuid=2e5f9bdf-f444-4a62-88c3-15882e84e1c9, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-6133e6a0-7312-4f20-85ac-3545cecc5bfd;nsid=870060382;c=0) 1 blocks. 12:44:07,381 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to blk_1073741825_1002{UCState=UNDER_CONSTRUCTION, truncateBlock=null, primaryNodeIndex=-1, replicas=[ReplicaUC[[DISK]DS-b2084b71-fe36-4a2e-9dd8-dc1b1094de7c:NORMAL:127.0.0.1:50010|RBW]]} size 7 12:44:07,382 DEBUG BlockStateChange: BLOCK* block RECEIVED_BLOCK: blk_1073741825_1002 is received from 127.0.0.1:50010 12:44:07,382 DEBUG BlockStateChange: *BLOCK* NameNode.processIncrementalBlockReport: from 127.0.0.1:50010 receiving: 0, received: 1, deleted: 0 12:44:07,391 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /copied_file.txt is closed by DFSClient_NONMAPREDUCE_772042077_1
Truncate operation in HDFS
The opposite operation for append is truncate. Its goal is to remove data from the tail of the file. The algorithm also manipulates the last block(s) to achieve the goal:
- The client sends a truncate request containing the name of the file to truncate and the new length.
- NameNode checks if the file is closed - otherwise the operation is not permitted.
- If after truncate the last block is not empty (for example: truncate removed only 2.5 of 3 last blocks), the NameNode marks file as Under Construction and acquires the lease on it.
- The last not fully truncated block is in Under Recovery state and the NameNode starts truncate recovery process.
- Truncate recovery consists on making all not fully truncated blocks to be of the same length. NameNode identifies one DataNode holding the block's replica and asks it to synchronize block's format in all DataNodes having it.
- When all DataNodes confirm the change, selected DataNode informs NameNode about it.
- The NameNode persists the change on edit logs and removes the lease from the file.
Handling truncate in the case of snapshots needs a little bit more work from the part of HDFS. When the last block is not fully truncated and it's used in one of snapshots, HDFS will create new block holding data after truncate. The old block will still keep the data before truncate, until the last snapshot using it is removed.
As in the case of append, below a simple use case of truncate:
# Create file with the length 7 hadoop fs -copyFromLocal ~/hadoop/tested_file.txt /copied_file2.txt hadoop fs -truncate 3 /copied_file2.txt Truncating /copied_file2.txt to length: 3. Wait for block recovery to complete before further updating this file.
Logs produced by this call are:
# NameNode receives client's request 12:58:51,539 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true ugi=bartosz (auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo src=/copied_file2.txt dst=null perm=null proto=rpc 12:58:51,574 INFO BlockStateChange: BLOCK* blk_1073741826_1003{UCState=UNDER_RECOVERY, truncateBlock=blk_1073741826_1004, primaryNodeIndex=0, replicas=[ReplicaUC[[DISK]DS-b2084b71-fe36-4a2e-9dd8-dc1b1094de7c:NORMAL:127.0.0.1:50010|RBW]]} recovery started, primary=ReplicaUC[[DISK]DS-b2084b71-fe36-4a2e-9dd8-dc1b1094de7c:NORMAL:127.0.0.1:50010|RBW] 12:58:51,603 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true ugi=bartosz (auth:SIMPLE) ip=/127.0.0.1 cmd=truncate src=/copied_file2.txt dst=null perm=bartosz:supergroup:rw-r--r-- proto=rpc # DataNode's work - the last block was not fully truncated, # the recovery is started 12:58:54,094 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode at localhost/127.0.0.1:9000 calls recoverBlock(BP-1817513253-127.0.1.1-1481542921087:blk_1073741826_1003, targets=[DatanodeInfoWithStorage[127.0.0.1:50010,null,null]], newGenerationStamp=1004) 12:58:54,095 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: blk_1073741826_1003, recoveryId=1004, replica=FinalizedReplica, blk_1073741826_1003, FINALIZED getNumBytes() = 7 getBytesOnDisk() = 7 getVisibleLength()= 7 getVolume() = /home/bartosz/hdfs_dir/data_blocks/current getBlockFile() = /home/bartosz/hdfs_dir/data_blocks/current/BP-1817513253-127.0.1.1-1481542921087/current/finalized/subdir0/subdir0/blk_1073741826 unlinked =false 12:58:54,095 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: initReplicaRecovery: changing replica state for blk_1073741826_1003 from FINALIZED to RUR 12:58:54,097 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: updateReplica: BP-1817513253-127.0.1.1-1481542921087:blk_1073741826_1003, recoveryId=1004, length=3, replica=ReplicaUnderRecovery, blk_1073741826_1003, RUR getNumBytes() = 7 getBytesOnDisk() = 7 getVisibleLength()= 7 getVolume() = /home/bartosz/hdfs_dir/data_blocks/current getBlockFile() = /home/bartosz/hdfs_dir/data_blocks/current/BP-1817513253-127.0.1.1-1481542921087/current/finalized/subdir0/subdir0/blk_1073741826 recoveryId=1004 original=FinalizedReplica, blk_1073741826_1003, FINALIZED getNumBytes() = 7 getBytesOnDisk() = 7 getVisibleLength()= 7 getVolume() = /home/bartosz/hdfs_dir/data_blocks/current getBlockFile() = /home/bartosz/hdfs_dir/data_blocks/current/BP-1817513253-127.0.1.1-1481542921087/current/finalized/subdir0/subdir0/blk_1073741826 unlinked =false # Summary of truncate operation: 12:58:54,131 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: truncateBlock: blockFile=/home/bartosz/hdfs_dir/data_blocks/current/BP-1817513253-127.0.1.1-1481542921087/current/finalized/subdir0/subdir0/blk_1073741826, metaFile=/home/bartosz/hdfs_dir/data_blocks/current/BP-1817513253-127.0.1.1-1481542921087/current/finalized/subdir0/subdir0/blk_1073741826_1004.meta, oldlen=7, newlen=3 # NameNode 12:58:54,132 DEBUG BlockStateChange: *BLOCK* NameNode.blockReceivedAndDeleted: from DatanodeRegistration(127.0.0.1:50010, datanodeUuid=2e5f9bdf-f444-4a62-88c3-15882e84e1c9, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-6133e6a0-7312-4f20-85ac-3545cecc5bfd;nsid=870060382;c=0) 1 blocks. 12:58:54,132 DEBUG BlockStateChange: BLOCK* block RECEIVED_BLOCK: blk_1073741826_1004 is received from 127.0.0.1:50010 12:58:54,132 DEBUG BlockStateChange: *BLOCK* NameNode.processIncrementalBlockReport: from 127.0.0.1:50010 receiving: 0, received: 1, deleted: 0 12:58:54,141 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(oldBlock=BP-1817513253-127.0.1.1-1481542921087:blk_1073741826_1003, newgenerationstamp=1004, newlength=3, newtargets=[127.0.0.1:50010]) successful
Append and truncate are opposite operations which make mutability possible in HDFS. Append allows to add new data at the end of file while truncate to cut some last characters in file. Both are different logic: append is much simpler since it deals mostly with file length. Truncate in the other side must take into account such aspects as not full last block or truncated block referenced in snapshots.
Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions.
As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and
drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!
👉 contact@waitingforcode.com
đź”— past projects