Modern data stack. Am I too old?

More and more often in my daily contact with the data world I hear this word "modern". And I couldn't get it. I was doing cloud data engineering with Apache Spark/Apache Beam, so it wasn't modern at all? No idea while I'm writing this introduction. But I hope to know more about this term by the end of the article!


That was my first concern. How to define this modern data stack? Some of the already written articles linked in the "Read more" section under the article helped me get the keywords describing this modernization of our world:

Without quoting any technology, an implementation of the Modern Data Stack could look like that:

"My" architecture can differ from the others you can find in other blog posts. I didn't put the data warehouse as a single central data repository because it can make sense using a data lake or a lakehouse (data type constraint, data processing strategy, ...). For sure, it's only my vision and you could freely replace the "Data storage" layer by "Data warehouse" to limit its transformation scope to the ELT paradigm. But I decided not to do that and preferred to keep the implementation open.

Am I too old?

Oh, yes, I am! This year I'm going to turn 35 👴 I do my best to keep myself up-to-date with the tech news and work with up-to-date technologies and smarter people than me. Sure, I'm missing something, like ACID file formats I'm trying to catch on this year, but I didn't feel old reading the promise of a Modern Data Stack.

Do not get this wrong because it brings new tools to easily manage the obscure data system parts, such as observability, quality, and governance. But I've a feeling that the main principle relies on easily scalable, pay-as-you-go, preferably managed, resources. And the cloud has provided them for a while already. Maybe the Modern Data Stack is a term to clearly separate the pre- and post-cloud evolution era?

Even for the core architecture, the MDS remains based on the classical idea of moving the data from one point to another with the goal to make it usable by end-users. Indeed, it identifies some tasks with new terms, such as the Reverse ETL or Customer Data Platform, but they are not new. I bet you've already pushed the data to a 3rd party system via an API or have created a user profile table in your data system. The single difference I can see is that you'll find them identified as a service offerings. So you can just use it and delegate the maintenance part to other companies.

This leads me to another point, the marketing. The MDS also impacts this field. After reading dozens of blog posts to discover the topic, I had a feeling that only the ELT-based systems are modern. In theory, there is nothing wrong with this data store-based transformations logic. But we should remain pragmatic and not consider it as a model or condition for being "modern". It's impossible to implement everything with SQL, UDFs, or store all kinds of data in the data warehouse.

But this marketing part also has some positive impact. It promotes data observability, quality, and governance, as the top-level components of a viable data system. Again, you've certainly implemented them in the previous projects, but their implementation was maybe less important than writing a new ETL/ELT pipeline. This time it should be different as they will be implemented as a part of this pipeline, not something aside. And this reminds me of the battle of adding unit tests to the software projects. In the beginning, some people considered them an addition that can be done after delivering the feature, maybe in 1 or 2 sprints (sometimes it might not happen at all if the task was of a low priority). Today, it's one of the non-breakable rules to pass the code from dev to production environment. Hopefully, the same will happen with the observability, quality, and governance!

Although this second section sounds less enthusiastic, do not take it as a criticism. Modern Data Stack brings some difficult parts of a data architecture to the light and proposes as-a-service solutions for them. It also marks a clear distinction between the era of the on-premise and cloud data systems. Maybe it lacks a more important Machine Learning focus, or it's too into ELT, but it's only the outcome of my interpretation. I'll be happy to read your thoughts on this topic in the comments under the article!