Processing static datasets is easier than dynamic ones that may change in time. Hopefully, cloud services offer various more and less manual features to scale the data processing logic. We'll see some of them in this blog post.
What would it take for you to trust your Databricks pipelines in production?
A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.
Konieczny
Orchestrator-based
The most primitive one from the list requires the biggest implementation effort. It relies on the flexibility of the cloud, where you can use various resources. Put another way, a job running on a 10-nodes cluster can freely use twice as many resources one day later.
The following steps summarize the solution:
- Analyze the dataset to process - it can be a volume analysis and more detailed analysis if you have some metadata available.
- Generate the cluster specification - use the analysis results from the previous step as the requirement for the cluster configuration. For example, if you assume that processing 1GB will "cost" you 2 cores and you have 20GB to process, then you will need 40 cores in the cluster.
- Create the cluster with the specification - thanks to cloud elasticity, you should be able to request as many nodes as possible regarding the service quotas.
- Submit the job
- Terminate the cluster once the job completes.
If you like to have control and avoid runtime scaling activity, it can be the first option to consider. But if you prefer some more automated solution, you can go to the next step!
Metrics-based
This second, more automated, strategy relies on the metrics emitted to the cloud monitoring system. If you look at it, you will certainly have a feeling of defining an alerting strategy because you'll need to specify a rule composed of:
- metric to monitor
- threshold to observe
- observation period
- cooldown period
- action
For example the following specification...:
- metric to monitor - average CPU usage of the cluster
- threshold to observe - > 80%
- observation period - 5 minutes
- cooldown period - 10 minutes
- action - add 3 nodes
...would translate to: Add 3 nodes if the average CPU usage is higher than 80% for 5 minutes and the previous scaling action terminated at least 10 minutes ago.
You will find this mode in EMR (automatic scaling with a custom policy) and Dataproc (autoscaling policy).
Fully managed
The last option is fully managed by the cloud provider. All you have to do is define the cluster boundaries so the minimum and maximum number of nodes the service can use. The scaling logic is often hidden in the service, and you can only have a vague knowledge of it.
For example, AWS EMR comes with a managed autoscaling based on cluster metrics. Databricks is also based on the cluster being over- or under-utilized. When it comes to GCP Dataflow, it's based on the CPU and workload, but also uses a predictive scaling and has a possibility to "communicate" the scaling information to the service from BoundedSource#getEstimatedSizeBytes and BoundedReader#getFractionConsumed methods.
To sum up, you have a choice in defining your auto-scaling strategy. If you like to control this, you can choose the first or the second one. But if you prefer to trust your service provider, you can only define the min and max number of nodes and let the service auto-scale the job on its own.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I wrote
one on that topic! You can read it online
on the O'Reilly platform,
or get a print copy on Amazon.
I also help solve your data engineering problems contact@waitingforcode.com đź“©
Related blog posts:
- What's new on the cloud for data engineers - part 12 (10.2023-02.2024)
- Vertical autoscaling for data processing on the cloud
- What's new on the cloud for data engineers - part 11 (06-09.2023)
How to define an auto-scaling strategy for cloud pipelines? I prepared a short summary in the new blog post ? https://t.co/rwGdVcMkLX
— Bartosz Konieczny (@waitingforcode) December 12, 2021
