Scaling data processing on the cloud

Processing static datasets is easier than dynamic ones that may change in time. Hopefully, cloud services offer various more and less manual features to scale the data processing logic. We'll see some of them in this blog post.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

Orchestrator-based

The most primitive one from the list requires the biggest implementation effort. It relies on the flexibility of the cloud, where you can use various resources. Put another way, a job running on a 10-nodes cluster can freely use twice as many resources one day later.

The following steps summarize the solution:

  1. Analyze the dataset to process - it can be a volume analysis and more detailed analysis if you have some metadata available.
  2. Generate the cluster specification - use the analysis results from the previous step as the requirement for the cluster configuration. For example, if you assume that processing 1GB will "cost" you 2 cores and you have 20GB to process, then you will need 40 cores in the cluster.
  3. Create the cluster with the specification - thanks to cloud elasticity, you should be able to request as many nodes as possible regarding the service quotas.
  4. Submit the job
  5. Terminate the cluster once the job completes.

If you like to have control and avoid runtime scaling activity, it can be the first option to consider. But if you prefer some more automated solution, you can go to the next step!

Metrics-based

This second, more automated, strategy relies on the metrics emitted to the cloud monitoring system. If you look at it, you will certainly have a feeling of defining an alerting strategy because you'll need to specify a rule composed of:

For example the following specification...:

...would translate to: Add 3 nodes if the average CPU usage is higher than 80% for 5 minutes and the previous scaling action terminated at least 10 minutes ago.

You will find this mode in EMR (automatic scaling with a custom policy) and Dataproc (autoscaling policy).

Fully managed

The last option is fully managed by the cloud provider. All you have to do is define the cluster boundaries so the minimum and maximum number of nodes the service can use. The scaling logic is often hidden in the service, and you can only have a vague knowledge of it.

For example, AWS EMR comes with a managed autoscaling based on cluster metrics. Databricks is also based on the cluster being over- or under-utilized. When it comes to GCP Dataflow, it's based on the CPU and workload, but also uses a predictive scaling and has a possibility to "communicate" the scaling information to the service from BoundedSource#getEstimatedSizeBytes and BoundedReader#getFractionConsumed methods.

To sum up, you have a choice in defining your auto-scaling strategy. If you like to control this, you can choose the first or the second one. But if you prefer to trust your service provider, you can only define the min and max number of nodes and let the service auto-scale the job on its own.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©