Scaling data processing on the cloud

Processing static datasets is easier than dynamic ones that may change in time. Hopefully, cloud services offer various more and less manual features to scale the data processing logic. We'll see some of them in this blog post.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Orchestrator-based

The most primitive one from the list requires the biggest implementation effort. It relies on the flexibility of the cloud, where you can use various resources. Put another way, a job running on a 10-nodes cluster can freely use twice as many resources one day later.

The following steps summarize the solution:

Analyze the dataset to process - it can be a volume analysis and more detailed analysis if you have some metadata available.
Generate the cluster specification - use the analysis results from the previous step as the requirement for the cluster configuration. For example, if you assume that processing 1GB will "cost" you 2 cores and you have 20GB to process, then you will need 40 cores in the cluster.
Create the cluster with the specification - thanks to cloud elasticity, you should be able to request as many nodes as possible regarding the service quotas.
Submit the job
Terminate the cluster once the job completes.

If you like to have control and avoid runtime scaling activity, it can be the first option to consider. But if you prefer some more automated solution, you can go to the next step!

Metrics-based

This second, more automated, strategy relies on the metrics emitted to the cloud monitoring system. If you look at it, you will certainly have a feeling of defining an alerting strategy because you'll need to specify a rule composed of:

metric to monitor
threshold to observe
observation period
cooldown period
action

For example the following specification...:

metric to monitor - average CPU usage of the cluster
threshold to observe - > 80%
observation period - 5 minutes
cooldown period - 10 minutes
action - add 3 nodes

...would translate to: Add 3 nodes if the average CPU usage is higher than 80% for 5 minutes and the previous scaling action terminated at least 10 minutes ago.

You will find this mode in EMR (automatic scaling with a custom policy) and Dataproc (autoscaling policy).

Fully managed

The last option is fully managed by the cloud provider. All you have to do is define the cluster boundaries so the minimum and maximum number of nodes the service can use. The scaling logic is often hidden in the service, and you can only have a vague knowledge of it.

For example, AWS EMR comes with a managed autoscaling based on cluster metrics. Databricks is also based on the cluster being over- or under-utilized. When it comes to GCP Dataflow, it's based on the CPU and workload, but also uses a predictive scaling and has a possibility to "communicate" the scaling information to the service from BoundedSource#getEstimatedSizeBytes and BoundedReader#getFractionConsumed methods.

To sum up, you have a choice in defining your auto-scaling strategy. If you like to control this, you can choose the first or the second one. But if you prefer to trust your service provider, you can only define the min and max number of nodes and let the service auto-scale the job on its own.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects