Recovery in HDFS

Versions: Hadoop 2.7.2

Recovery process in HDFS helps to achieve fault tolerance. It concerns as well worker pipeline as blocks.

This post covers 3 types of recovery in HDFS: lease, block and pipeline. Each of them is presented in separate part.

Lease recovery

As it was already mentioned in some of latest HDFS posts, the lease is a method used by HDFS to guarantee exclusive writing access to a file. It's guaranteed for a certain period of time to one and only one client. The lease must be renewed in regular period of time if the client still needs to work on the file.

The lease is released explicitly or implicitly by the client. If the client stops to work on file, it closes the file for writes and releases the lease. But if something goes wrong with the client and it can't normally finish the work, the lease is automatically recovered by HDFS. To do so, HDFS uses a technique based on definition of 2 properties: soft and hard limit. Currently they're hardcoded as LeaseManager constants:

public static final long LEASE_SOFTLIMIT_PERIOD = 60 * 1000;
public static final long LEASE_HARDLIMIT_PERIOD = 60 * LEASE_SOFTLIMIT_PERIOD;

These 2 limits are used to determine what happens with the lease in time. If the soft limit passes and the client holding current lease doesn't renew it, other client will be able to acquire the lease. When the hard limit passes HDFS automatically cancels previous lease. This operation can be also triggered manually with przez hdfs debug recoverLease command.

Block recovery

Lease recovery is followed by block recovery. Its goal is to ensure that all the replicas of the last block of leased file have the same length. Lease recovery can prevent the last block to send its data to all DataNodes. In consequence, some of DataNodes can hold obsolete blocks.

To fix this problem, block recovery is triggered. First the NameNode selects one DataNode which will be a Primary DataNode (PD). The purpose of the PD is to coordinate block recovery with blocks remaining in the pipeline. Coordination steps are below:

  1. PD gets new GStamp from NameNode
  2. PD gets the information about blocks of remaining DataNodes
  3. PD computes the minimal length of blocks
  4. PD updates the GStamp and block lengths of other DataNodes
  5. PD notifies NameNode that updates are done
  6. NameNode updates its own information about block updated previously
  7. The lease of block's file is recovered
  8. All changes are validated and persisted to NameNode's edit logs

Pipeline recovery

Another type of recovery concerns pipeline, so the situation when DataNode is writing some data and one failure occurs before the end of the operation. But before describing the plan of this recovery, let's recall some facts: block data is not sent entirely to DataNode. Instead it's moved in some small packages. The client sends it only to the 1st DataNode in working plan and this DataNode, if replication factor is at least 2, moves all packets to other DataNodes destined to hold block replicas. The pipeline represents this process of sending block's data from the client to the last DataNode replica. The pipeline is created locally for each block - the blocks can be replicated in different DataNodes.

The pipeline is divided in 3 stages:

  1. Setup - the clients asks NameNode to execution plan containing all DataNodes concerned by the pipeline.
  2. Streaming - here DataNodes send blocks internally, from one replica to another.
  3. Close - replicas are finalized and the pipeline turned off.

In each of these stages failures can occur. When a problem has place, we talk about pipeline recovery which, in previous stages means:

  1. Setup stage - the last of DataNodes returned in the execution plan must acknowledge the demand of the client writing a file. If something goes wrong here, the scenario depends on writing block type. If the client is writing new block, it will ask NameNode to return new execution plan. If the block already exists, the client will use current pipeline but without failing DataNodes.
  2. Streaming stage - the failures in this stage also have different issues. If the failure is detected by DataNode itself, it excludes itself from the pipeline and closes all connections. The transmission is made without it. If the client detects the failure, it will change GStam and create new pipeline without failing DataNodes. If one of functional DataNodes have already received the data, they will simply ignore it.
  3. Close stage - when the number of finalized replicas is at least the same as expected replication factor, the blocks are moved to complete stage. If the client detects a dysfunction at this stage, it will create a new pipeline containing correctly working DataNodes.

In some of scenarios failing DataNode is replaced by new one. This behavior, unlike soft and hard time in lease recovery, is parametrized through dfs.client.block.write.replace-datanode-on-failure.enable entry. In more fine grained level below properties can also be used: dfs.client.block.write.replace-datanode-on-failure.policy, dfs.client.block.write.replace-datanode-on-failure.best-effort. The first parameter tells what to do with failed DataNode. It takes one of below 3 entries:

The dfs.client.block.write.replace-datanode-on-failure.best-effort property defines the behavior of writing pipeline on replacement of failed DataNode. If set to true, writing operation will continue - even if write on replacing DataNode fails too. When set to false, an exception will be thrown in the case of writing failure on replacing DataNode.

This post resumes 3 kinds of recovery failures. The first section describes what happens if lease acquired on writed file expires. The second part concerns block recovery. It helps to ensure that all replicas last blocks have the same length. The last section describes recovery pipeline, i.e. recovery process occurring during file writing and DataNode failures.