In our social media and marketing-driven era, it's quite hard to get things right. For me there is one common misconception brought by the Modern Data Stack idea that everything should be now ELT. In fact no, it shouldn't but only can.
A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free
👉 Register here
As a curious data engineer, I have worked with both approaches but turns out, I've never asked myself this fundamental question, what does it involve having ETL-only or ELT-only pipelines. Since it sounds like a lightweight topic and I'm writing this a Sunday afternoon, it won't be another moment to approach it!
Definitions and introduction
If you are in data engineering you certainly know these terms but just for the heck of it, let me introduce them before talking about any differences. The ETL stands for Extract Transform Load while ELT for Extract Load Transform. They are both a common approach for processing data. The difference between is firstly technological. While ETL favors a pure data processing layer, such as an Apache Spark cluster, the other prefers a kind-of data warehouse-first citizenship where the data directly lands in the final data store and gets transformed there.
Despite this technological difference, both approaches are somehow in the continuous fight, especially in our social media era. "The end of ETL", "Long live to ELT", ... you've certainly seen some of the headers like these. And they were there even before this social media eta, but at that time, the fight was mostly driven by technology. Either the storage was too expensive and people preferred ELT, or the data was so huge that scaling a data warehouse to handle it was impossible, so everyone implemented ETL. I'm oversimplifying this a bit but found the decisions in that period less influenced by the marketing and superficial opinions I'm challenging in the last section.
Nowadays, ELT seems to have regained popularity but IMO, it's only partially related to the pure technical reasons. Modern Data Stack is an interesting technical concept but it also has a strong marketing support that hides the most important point to keep in mind while designing data platforms. The context. If the transformations you're performing are difficult to express in SQL, even with the support of UDFs or scalable data warehouses, you'll certainly go for ETL pipelines or otherwise, have to write a monster that nobody except you will be able to maintain (can you imagine maintaining a query long for multiple A4 pages?). If on the other hand, your data logic is rather simple and can be fully supported with SQL, you may go with ELT. I put "may" because ELT implies other things that could make you stay with ETL, and vice versa!
So if you have expected to find a firm sentence like "ETL is dead" or "ELT is dead" somewhere in this blog post, sorry for disappointing you. Instead, I'll give you some technical reasons for preferring one or another approach. Or why not both, because they're not exclusive in a data platform!
Pros
- Flexibility. ETL gives more flexibility for expressing the data transformations. You can stay with SQL for simple operations or use a programming language for more complex ones. In ELT, even though you can define User-Defined Functions, they still apply at a column level and may not be that easy to understand and implement than the functions or classes in the code.
- Access control. Here ELT looks better. The data lives - by definition - in a single platform, so you have only one layer to manage. In ETL the data can be anywhere and the difference is even more visible with multiple ETL vs. multiple ELT pipelines. The access management, due to the single nature of ELT, should be easier to set up than for ETL.
- Regulatory compliance. If you need to remove some fields before transforming the dataset, you simply cannot do it in ELT where the data is first loaded and only later transformed. It's possible for ETL because the "T" happens before and gives this flexibility.
- Data lineage. Since it's all reduced to a single storage, following the lineage of an ELT dataset should be easier than an ETL one.
- Scaling. Despite the innovation on the storage and compute decoupling in data warehouses, it should easier to scale the ETL pipelines. Especially when it comes to choosing purpose-adapted machine types and run, for example memory-intensive or compute-intensive workloads.
- Stack. It's a downside of flexibility. It doesn't bring a uniform stack because the workload can run on any compute, distributed, single node, serverless, or not... The ELT is good because the technical stack gets reduced to an orchestrator, data warehouse, and SQL queries with a compute framework such as dbt.
You can already see, there is no silver bullet. Both patterns have their strong and weak points and on purpose, I listed 3 pros for each of them. I won't focus on cons because you can conclude them from the list above. Instead, I'll directly try to bust some myths about ELT and ETL.
Myths busted
- ELT has a lower latency - if you consider latency as loading raw data into your data warehouse, by definition it'll be faster since the "L" is in the 2nd place. BUT if you consider latency as an overall data delivery, it's not necessarily true. Imagine a transformation that is hard to express in SQL and that, by bad luck, runs at the same time as other complex transformations of that type. Now, compare it with the situation when you can create an isolated compute environment for this operation in an ETL pipeline. Probably the latter will finish faster, especially if you have some scaling limits on the data warehouse.
- Low maintenance with ELT (...) since it relies on automated solutions that don't require manual updates - if we're talking here about the "T" part, the ETL also relies on the automated solutions that don't require manual updates. Think about AWS EMR, Databricks, GCP Dataflow, have you ever needed to manually update the cluster dependencies? No, it's the cloud provider's responsibility. Of course, you have to maintain the libraries of your processing jobs but they are there to facilitate your work after all. On the opposite side, if you have a SQL query, it's much harder to include 3rd party libraries that can help you avoid reinventing the wheel.
- The main issues of ETL are maintenance, complexity, and expense (...) maintenance will involve frequent code changes, service restoration, modeling changes - first, the code changes issues ARE related to the code, so even an ELT transformation can require frequent code changes to follow the business rules! Complexity? Agree if you've a pure SQL or analytics background and have never coded anything. But knowing only SQL in today's context is a limitation that closes many professional opportunities. And if you think that maintaining SQL is easy, you've probably never had to reverse-engineer a SQL query printed on 9 A4 pages (lucky you!).
- ELT has inherent scalable cloud computing abilities, it's better to load and process large datasets (...) while ETL is best suited for smaller data chunks - oh boy, so I learned that ETL doesn't run on the cloud 😭 It was my sarcastic mode, don't get it wrong, ETL is also scalable and also can process large datasets!
- ETL is cheaper or ELT is cheaper - it depends. If you need a huge data warehouse to run all SQL queries, or if you use long-running clusters for short-living jobs, you pay the price for both. It's hard to say that one is more expensive than the other without the context and the setup.
- In ETL you impose business logic on the data and cannot easily rerun the pipeline if the business rules change - really? It's all a matter of trade-offs. If I cannot rerun, it means that the pipeline doesn't store the extracted data, so I save some storage and maintenance costs. If I can, the pipeline has this flexibility but it'll cost me more. And what for ELT? Agree, it's more natural to store the outcome of the EL process but do you always store the raw data? There is the same trade-off to make. I store, I pay more and need to operationalize that part (access, retention, encryption, ...) but I can replay. If not, I cannot replay but you reduce the maintenance and cost.
There is probably one pattern that will fit better into your context, the pattern those drawbacks you'll accept more. Because yes, none of them is perfect. Both have some downsides, both can be badly implemented, and so despite all the - often purely marketing - content that you can find in favor of one or another. And before I let you go, let me embed a LinkedIn post that I saw while writing the "Myths busted" part. It really shows where your focus should go. Thank you, Daniel for this one!