Dead-letter pattern on the cloud

Data is not always as clean as we would like it to be. The statement is even more true for semi-structured formats like JSON, where we feel working with a structure, but unfortunately, it's not enforced. Hence, from time to time, our code can unexpectedly fail. To handle this problem - as for many others - there is a pattern. It's called dead-letter qnd I will describe it below in the context of cloud services.

New ebook 🔥

Learn 84 ways to solve common data engineering problems with cloud services.

👉 I want my copy

But before I show you how implemented on various cloud providers, let's take a look at the Apache Flink. Sounds surprising, doesn't it? After all, I spend a lot of time explaining data engineering with the help of Apache Spark. However, for the dead-letter pattern, Apache Flink is - in my opinion - more idiomatic. Apache Flink has a concept of side output that you can use to receive invalid input records. Thanks to this clear separation, we can easily define a different sink for this invalid output part:

  val env = StreamExecutionEnvironment.createLocalEnvironment()
  import org.apache.flink.streaming.api.scala._
  val outputTag = OutputTag[String]("invalid-payloads")
  val inputDataSource = env.fromElements("not a JSON", "{empty JSON}", """{"letter": "A"}""")
  val validJsonEvents = inputDataSource.process(
    (value: String, ctx: ProcessFunction[String, String]#Context, out: Collector[String]) => {
    if (value.startsWith("""{"""")) {
    } else {
      ctx.output(outputTag, s"invalid=${value}")
  val invalidJsonEvents = validJsonEvents.getSideOutput(outputTag)

GCP BigQuery

Dead-Letter pattern is very often associated to the streaming processing. However, it's not a pure streaming concept! You can use it in batch pipelines as well! One of the first examples I found - BTW, it inspired me to write it down - is BigQuery, and more exactly, BigQueryIO connector available for Apache Beam.

According to the snippet available in the official Beam documentation, any BigQuery insert failure can be intercepted with the help of getFailedInsertsWithErr() method and passed to a dead-letter table.

AWS Lambda

Another service associated in my mind with the dead-letter pattern is AWS Lambda. This serverless service supports the pattern in 2 ways. The first of them use a dead-letter queue, an SNS or SQS where Lambda will send all events from erroneous executions. The second one is the on-failure destination, and it's currently similar to the dead-letter queue except that it supports more target destinations. In addition to SQS or SNS, you can send the events to Event Bridge or invoke another Lambda function.

It's also worth mentioning various possible configurations for error management. You can define the maximal number of attempts to process a record of a batch of records. Regarding the batch of records, you can also specify the recursive split of the failed batch so that the function can easier isolate the failing record.

Unlike the BigQuery example quoted before, you can notice here that the failure management is fully delegated to the service. Of course, you can write your block of code that will write the output to different places, depending on the processing outcome, but it's your responsibility to do so.

Azure ServiceBus

Regarding Azure, you will find the dead-letter pattern implementation in the Service Bus messaging service. It proposes built-in and application-specific rules that will put the messages to the Dead-Letter Queue. Among the rules applying for the built-in reasons, you will find the message expiration, unsuccessful processing within the max delivery attempts, or a too big header.

In addition to this built-in mechanism, you can also implement your own dead-letter rules with the help of SDK to, for example, forward unparseable messages to the Dead-Letter Queue. In Python, this can be achieved with the help of dead_letter_message(message, reason, error_description) function quoted in "Further reading" section.

As you can see then, the Dead-Letter pattern is one of the popular methods to deal with poison pills, i.e., events that cannot be correctly processed or ingested to the sink. However, using the pattern doesn't stop on forwarding these events. In addition to that, you should always keep an eye on them and analyze the dead-letter reasons. Maybe something changed on the producer's side and you may need to update the consumers? Or simply, you released a buggy version of the consumers and after fixing it, you may need to reingest the failed messages?