Landing zone or direct writes?

I don't know whether it's a good sign or not, but I start having some convictions about building data systems. Of course, building an architecture will always be the story of trade-offs but there are some practices that I tend to prefer than the others. And in this article I will share my thoughts on one of them.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

TL; TR: that's only my thoughts about the topic and I don't have the monopolies of knowledge. Take it carefully and do not consider it as a single source of truth.

First and foremost, what do I mean by landing zone and direct writes? To understand the difference, let's take the example of a data pipeline that processes some JSON files and has to write it to a data warehouse storage (structured data). The dataflow is represented in the following diagram:

How can this pipeline be implemented? One approach would write the data directly to the data warehouse solution from the data processing logic and that's something I'm calling here the direct writes. The second one would pass through intermediary storage and the data loading step would be performed by another process managed by an orchestrator. For the purpose of this blog post I called this intermediary storage landing zone. The following picture summarizes these 2 approaches:

I won't hide it, I like the second one. Why? Due to the following reasons:

But don't get me wrong, I'm still convinced that the solution should be adapted to the problem and maybe for different use cases, a landing zone will not exist. I just hope that thanks to the list you will maybe discover new pros of using a landing zone that may have a real impact on your architecture design! And I'm really curious about your vision for this problem and will be glad if you share it on the comments below :)

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©