The "vertical scaling" has caught my attention a few times already when I have been reading about cloud updates. I've always considered horizontal scaling as the single true scaling policy for elastic data processing pipelines. Have I been wrong?
Horizontal and vertical scalings have different focus. The former optimizes the workload by adding extra nodes to the cluster. It's a good candidate if you scaled your data source (e.g. added new partitions) or got more data to read (e.g. more files for unpartitioned data sources) and need some extra compute power to handle that extra load. Vertical scaling is different. It doesn't change the number of nodes but their type. Typically, if your job uses only 50% of allocated memory, is close to these allocation limits, or struggles in doing asynchronous computations, adding extra machines can help only when the workload can be distributed. If not, they will remain idle and you will need a different strategy, the vertical one.
So far the single way to modify the composition of the running clusters has consisted of stopping them, changing the underlying node types, and restarting. As you can imagine, it involves downtime. But this is the past!
Modern data processing offerings, including EMR, Databricks, Dataflow (?, still unsure about that, compute environment), use under-the-hood Kubernetes. And Kubernetes has a great autoscaling support. There is a Horizontal Pod Autoscaler for the horizontal scaling. There is a Cluster Autoscaler to automatically adapt the cluster capacity to the scaling needs. Finally, there is also a Vertical Pod Autoscaler (VPA) that supports vertical approach.
How does it work? VPA analyzes historical resource utilization, amount of resources available in the cluster and real-time events, such as OOMs. It uses the analysis outcome to adjust the pod requirements in real-time with the goal to improve the overall cluster utilization.
The VPA architecture has 3 main components:
- Recommender. It provides the recommended values for the running containers by analyzing the aforementioned metrics.
- Updater. It performs the physical replacement of pods with incorrectly set resources.
- Admission controller. It intercepts pod creation requests. If the request concerns the VPA-managed pod and its config is different than "off", the controller rewrites the requested compute resources by using the values returned by the Recommender.
The Updater supports different modes for resource management via an updatePolicy configuration. For the "Initial" setting, it can only overwrite the resources at the pod's creation time. When you need to extend this behavior by the runtime adjustments, you can set the policy to "Auto". But if you need only the resource recommendations without the update action, you can define the policy to "Off".
The update consists of replacing the pods, so shutting down the old ones and creating their replacements. It may then involve some downtime but the impact should be weaker than for the "Old days" method.
Vertical autoscaling and data processing
I mentioned the VPA not without reason. AWS EMR on EKS uses it in the Vertical autoscaling feature. All you need to do besides installing VPA of course, is to add these 2 configurations to the driver while submitting the job:
The former parameter enables or disables the dynamic sizing while the latter provides a unique id for the job resources. You can also extend the configuration with more VPA-related entries and hence, limit the memory or CPU that can be allocated, or configure the VPA updatePolicy.
Besides EMR on EKS, there is also another service supporting vertical autoscaling. GCP Dataflow Prime uses the feature to fine-tune the memory usage and avoids OOM errors. The feature replaces existing workers with the new ones that are more adapted to the real needs. Each vertical auto scaling deactivates the horizontal autoscaling for 10 minutes. Unlike the VPA, the feature on Dataflow impacts the memory only.
Relocating your workers to more efficient nodes is a great cost saving and workload optimization feature. However it seems both features are not task-aware and may replace the worker when it's running a task.