Data orchestration on the cloud

When it comes to executing one isolated job, there are many choices and using a data orchestrator is not always necessary. However, it doesn't apply to the opposite scenario where a data orchestrator not only orchestrates the workload but also provides a monitoring layer. And the question arises, what to do on the cloud?

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Data orchestration services

That's the obvious choice - orchestrating data workloads with data orchestration technologies. And you shouldn't be lost with the offered services. AWS and GCP have a managed version of Apache Airflow. On AWS it's called Amazon Managed Workflows for Apache Airflow whereas on GCP you will find it under the Cloud Composer. Their difference with the Open Source version? Mostly smaller setup and maintenance overhead and better integration with other services of the provider. For example, in Cloud Composer you can rely on Secret Manager to store the connections and variables safely and manage the whole solution with the Infrastructure as Code.

But a managed Apache Airflow is not the single data orchestrator available on the cloud. Azure took a different strategy and stayed with its own solution called Data Factory. It also integrates pretty well with other cloud services, like KMS, to store and use any secrets in the pipeline. The difference with the 2 previous services is its no-code character. Well, you will need some code to move the pipeline from the development to production environment, but to author the workload, you will more often use the UI rather than the ARM templates or Azure SDK.

Serverless

An alternative to the classical data orchestrators? Serverless orchestrators like AWS Step Functions or Azure Durable Functions. They integrate pretty well with the serverless functions offerings, therefore, they are a good candidate for the event-driven workloads. They seem to be a better approach than chaining these functions manually or with a custom solution, because they provide a big picture of the workload and a native errors management.

Besides the events, they can also trigger the workloads from a simple time-based expression and also use other compute resources than serverless like EMR on AWS that can be automatically created from a Step Function pipeline.

In addition to the runtime capabilities, the serverless workloads - since they're serverless - offer the real pay-as-you-go pricing model where the total cost is equal to the number of invocations or state transitions.

However, this serverless option has some drawbacks too, and one of the biggest is the lack of the monitoring and alerting layer for complex scenarios with mulitple dependencies, that you can have for the pure data orchestration solution. Sure, you can still use the cloud monitoring services, but do they provide a global view of the workload or the possibility to reprocess or backfill the data out-of-the-box? Probably it won't be possible without an extra coding effort.

Data processing services

The final orchestration possibility comes from the data processing services like EMR, Databricks or Dataproc. They all are great runtime environments for Apache Spark or Apache Flink jobs, but besides, they can also orchestrate multiple dependent jobs in the same cluster.

Even though this orchestration part goes more and more complex in these services, like recently with the announcement of the Databricks multiple tasks orchestration, it's still an extra feature limited to the single job flow. In data orchestration, you will often need a global picture of the system (= multiple jobs) and a possibility to chain different pipelines.

It's quite probable that you will often use the data orchestration services in data engineering workloads. They integrate very well to the batch and classical data workloads. However, some of them are quite interesting like the serverless orchestrators for event-driven architectures or data processing tasks orchestration.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects