Processing static datasets is easier than dynamic ones that may change in time. Hopefully, cloud services offer various more and less manual features to scale the data processing logic. We'll see some of them in this blog post.
New ebook 🔥
Learn 84 ways to solve common data engineering problems with cloud services.
The most primitive one from the list requires the biggest implementation effort. It relies on the flexibility of the cloud, where you can use various resources. Put another way, a job running on a 10-nodes cluster can freely use twice as many resources one day later.
The following steps summarize the solution:
- Analyze the dataset to process - it can be a volume analysis and more detailed analysis if you have some metadata available.
- Generate the cluster specification - use the analysis results from the previous step as the requirement for the cluster configuration. For example, if you assume that processing 1GB will "cost" you 2 cores and you have 20GB to process, then you will need 40 cores in the cluster.
- Create the cluster with the specification - thanks to cloud elasticity, you should be able to request as many nodes as possible regarding the service quotas.
- Submit the job
- Terminate the cluster once the job completes.
If you like to have control and avoid runtime scaling activity, it can be the first option to consider. But if you prefer some more automated solution, you can go to the next step!
This second, more automated, strategy relies on the metrics emitted to the cloud monitoring system. If you look at it, you will certainly have a feeling of defining an alerting strategy because you'll need to specify a rule composed of:
- metric to monitor
- threshold to observe
- observation period
- cooldown period
For example the following specification...:
- metric to monitor - average CPU usage of the cluster
- threshold to observe - > 80%
- observation period - 5 minutes
- cooldown period - 10 minutes
- action - add 3 nodes
...would translate to: Add 3 nodes if the average CPU usage is higher than 80% for 5 minutes and the previous scaling action terminated at least 10 minutes ago.
You will find this mode in EMR (automatic scaling with a custom policy) and Dataproc (autoscaling policy).
The last option is fully managed by the cloud provider. All you have to do is define the cluster boundaries so the minimum and maximum number of nodes the service can use. The scaling logic is often hidden in the service, and you can only have a vague knowledge of it.
For example, AWS EMR comes with a managed autoscaling based on cluster metrics. Databricks is also based on the cluster being over- or under-utilized. When it comes to GCP Dataflow, it's based on the CPU and workload, but also uses a predictive scaling and has a possibility to "communicate" the scaling information to the service from BoundedSource#getEstimatedSizeBytes and BoundedReader#getFractionConsumed methods.
To sum up, you have a choice in defining your auto-scaling strategy. If you like to control this, you can choose the first or the second one. But if you prefer to trust your service provider, you can only define the min and max number of nodes and let the service auto-scale the job on its own.