Design patterns applied to the data

GoF Design Patterns are pretty easy to understand if you are a programmer. You can read one of many books or articles, and analyze their implementation in the programming language of your choice. But it can be less obvious for data people with a weaker software engineering background. If you are in this group and wondering what these GoF Design Patterns are about, I hope this article will help a bit.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ contact@waitingforcode.com πŸ“©

To explain the Design Patterns, I will not use any line of code. Instead, I will try to compare them to the data components that, maybe, you already used in your data engineering life.

Singleton - master dataset

That's the easiest one! Singleton defines an object that should be instantiated only once, i.e. in your application, you will find a single instance of it. These "single" and "instance" are the key terms to find its analogy in the data world. Initially, I wanted to put here the master dataset concept from Lambda architecture, but I discovered later that this architectural component is a part of a more global concept called Master Data Management (MDM).

MDM is a set of procedures established to ensure that the master (aka reference) dataset is accurate and consistent across the whole organization. In less formal terms, it ensures that all downstream consumers use the same data for a given concept (e.g. customer, products). For consistency reasons, it's then mandatory to expose only one master dataset representing it. Hence, the analogy with the singletons.

Facade - view

The next design pattern is related to a data concept that you certainly met in your work with relational databases, the views. What is the relationship with the design patterns? The view idea represents pretty well the facade pattern.

The goal of the facade pattern is to hide the complexity of the underlying architecture. The same applies to the views, that you can create in various ways, like for example with UNION or JOIN operators. If all the view users should "remember" the view definition, its evolution would be tough because it would require the *synchronized* update of all the clients and the view definition. However, thanks to this extra intermediary view abstraction, you, as the data owner, can freely change the definition without disturbing the consumers. Of course, only if you still expose the same structure.

Memento - Delta Lake versions

The next design pattern is called memento. Its idea is to guarantee state recoverability. A great example of that is the "Undo" and "Redo" action in the visual text editors. And also, the data world has plenty of examples.

Among the data memento examples, you will find the object store versions (e.g. S3) but also the versioning added in the new ACID-compatible file formats like Delta Lake. 🆘 I'm still looking for a correct definition for these formats (Delta Lake, Hudi, Iceberg). If you found anything better than the "ACID-compatible", I will be happy to learn and maybe use it!

Observer - a streaming consumer

This one was an easy one. By looking at the observer's definition, you can easily find its correspondence in the data world. It's because this definition very often includes the "publish-subscribe" keyword.

An example of an observer in the data world is a streaming consumer, e.g. an Apache Kafka consumer. Initially, it consumes all records produced to a given topic. We say that the consumer is an observer of the subject (topic). Whenever the consumer is not interested anymore in receiving the events of that topic, it can simply stop and unsubscribe.

Of course, the streaming example is valid as long as the topic contains all events the given consumer is interested in. As an alternative to it, you can think about bindings in RabbitMQ.

Adapter - Apache Spark

If you ever had a problem to charge your electronic device on holidays due to the different sockets, you probably know the meaning of the adapter design pattern. The idea here is to use an abstraction to make incompatible interfaces work together. In the real world, it will be a socket (plug) adapter, and what will be the adapter's example in the data world?

Apache Spark, of course! And, more exactly, the data source API. Imagine your data engineering life without the possibility to read and process different data formats within a single tool. What words come to your mind when you imagine that? Nightmare? Fortunately, thanks to Apache Spark unified data processing model - after all, everything is represented as a single processing abstraction - your and my work are much easier to do :).

Chain of responsibility - ETL pipeline

The last design pattern called chain of responsibility is a great example illustrating a software engineering principle of prefering composition over inheritance. The idea is to create a chain of classes/functions that will be called sequentially to process the input parameter. In 2017 I wrote an example of its application to Apache Spark UDF but it's not the single one you can find in the data world.

For me, this pattern looks more like an ETL pipeline where the input is transformed by consecutive steps until reaching the last stage generating the final output.

Apart from this sequential character, the second point describing the chain of responsibility is the ability to define a handle in the chain that will skip the request, for example, when it doesn't know how to handle it. You can achieve a similar thing for ETL pipelines with the concept of branching.

The patterns described here represent only a small subset. There are many other patterns and if you are not comfortable with them, finding a data analogy can be an excellent way to improve your understanding.