When it comes to executing one isolated job, there are many choices and using a data orchestrator is not always necessary. However, it doesn't apply to the opposite scenario where a data orchestrator not only orchestrates the workload but also provides a monitoring layer. And the question arises, what to do on the cloud?
New ebook 🔥
Learn 84 ways to solve common data engineering problems with cloud services.
Data orchestration services
That's the obvious choice - orchestrating data workloads with data orchestration technologies. And you shouldn't be lost with the offered services. AWS and GCP have a managed version of Apache Airflow. On AWS it's called Amazon Managed Workflows for Apache Airflow whereas on GCP you will find it under the Cloud Composer. Their difference with the Open Source version? Mostly smaller setup and maintenance overhead and better integration with other services of the provider. For example, in Cloud Composer you can rely on Secret Manager to store the connections and variables safely and manage the whole solution with the Infrastructure as Code.
But a managed Apache Airflow is not the single data orchestrator available on the cloud. Azure took a different strategy and stayed with its own solution called Data Factory. It also integrates pretty well with other cloud services, like KMS, to store and use any secrets in the pipeline. The difference with the 2 previous services is its no-code character. Well, you will need some code to move the pipeline from the development to production environment, but to author the workload, you will more often use the UI rather than the ARM templates or Azure SDK.
An alternative to the classical data orchestrators? Serverless orchestrators like AWS Step Functions or Azure Durable Functions. They integrate pretty well with the serverless functions offerings, therefore, they are a good candidate for the event-driven workloads. They seem to be a better approach than chaining these functions manually or with a custom solution, because they provide a big picture of the workload and a native errors management.
Besides the events, they can also trigger the workloads from a simple time-based expression and also use other compute resources than serverless like EMR on AWS that can be automatically created from a Step Function pipeline.
In addition to the runtime capabilities, the serverless workloads - since they're serverless - offer the real pay-as-you-go pricing model where the total cost is equal to the number of invocations or state transitions.
However, this serverless option has some drawbacks too, and one of the biggest is the lack of the monitoring and alerting layer for complex scenarios with mulitple dependencies, that you can have for the pure data orchestration solution. Sure, you can still use the cloud monitoring services, but do they provide a global view of the workload or the possibility to reprocess or backfill the data out-of-the-box? Probably it won't be possible without an extra coding effort.
Data processing services
The final orchestration possibility comes from the data processing services like EMR, Databricks or Dataproc. They all are great runtime environments for Apache Spark or Apache Flink jobs, but besides, they can also orchestrate multiple dependent jobs in the same cluster.
Even though this orchestration part goes more and more complex in these services, like recently with the announcement of the Databricks multiple tasks orchestration, it's still an extra feature limited to the single job flow. In data orchestration, you will often need a global picture of the system (= multiple jobs) and a possibility to chain different pipelines.
It's quite probable that you will often use the data orchestration services in data engineering workloads. They integrate very well to the batch and classical data workloads. However, some of them are quite interesting like the serverless orchestrators for event-driven architectures or data processing tasks orchestration.