Scaling data processing on the cloud

Processing static datasets is easier than dynamic ones that may change in time. Hopefully, cloud services offer various more and less manual features to scale the data processing logic. We'll see some of them in this blog post.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ πŸ“©


The most primitive one from the list requires the biggest implementation effort. It relies on the flexibility of the cloud, where you can use various resources. Put another way, a job running on a 10-nodes cluster can freely use twice as many resources one day later.

The following steps summarize the solution:

  1. Analyze the dataset to process - it can be a volume analysis and more detailed analysis if you have some metadata available.
  2. Generate the cluster specification - use the analysis results from the previous step as the requirement for the cluster configuration. For example, if you assume that processing 1GB will "cost" you 2 cores and you have 20GB to process, then you will need 40 cores in the cluster.
  3. Create the cluster with the specification - thanks to cloud elasticity, you should be able to request as many nodes as possible regarding the service quotas.
  4. Submit the job
  5. Terminate the cluster once the job completes.

If you like to have control and avoid runtime scaling activity, it can be the first option to consider. But if you prefer some more automated solution, you can go to the next step!


This second, more automated, strategy relies on the metrics emitted to the cloud monitoring system. If you look at it, you will certainly have a feeling of defining an alerting strategy because you'll need to specify a rule composed of:

For example the following specification...:

...would translate to: Add 3 nodes if the average CPU usage is higher than 80% for 5 minutes and the previous scaling action terminated at least 10 minutes ago.

You will find this mode in EMR (automatic scaling with a custom policy) and Dataproc (autoscaling policy).

Fully managed

The last option is fully managed by the cloud provider. All you have to do is define the cluster boundaries so the minimum and maximum number of nodes the service can use. The scaling logic is often hidden in the service, and you can only have a vague knowledge of it.

For example, AWS EMR comes with a managed autoscaling based on cluster metrics. Databricks is also based on the cluster being over- or under-utilized. When it comes to GCP Dataflow, it's based on the CPU and workload, but also uses a predictive scaling and has a possibility to "communicate" the scaling information to the service from BoundedSource#getEstimatedSizeBytes and BoundedReader#getFractionConsumed methods.

To sum up, you have a choice in defining your auto-scaling strategy. If you like to control this, you can choose the first or the second one. But if you prefer to trust your service provider, you can only define the min and max number of nodes and let the service auto-scale the job on its own.

If you liked it, you should read:

πŸ“š Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!