Spark on Kubernetes articles

Docker images and Apache Spark applications

Containers are with us, data engineers, for several years. The concept was already introduced on YARN but the technology that really made them popular was Docker. In this post I will focus on its recommended practices to make our Apache Spark images better.

Continue Reading →

Setting up Apache Spark on Kubernetes with microk8s

When I discovered microk8s I was delighted! An easy installation in very few steps and you can start to play with Kubernetes locally (tried on Ubuntu 16). However, running Apache Spark 2.4.4 on top of microk8s is not an easy piece of cake. In this post I will show you 4 different problems you may encounter, and propose possible solutions.

Continue Reading →

Apache Spark on Kubernetes - init containers

Initialization is a very first step of almost all applications. Unsurprisingly it's also the case of Kubernetes that uses Init Containers to execute some setup operations before launching the pods.

Continue Reading →

Apache Spark on Kubernetes - useful commands

Beginning with new tool and its CLI is never easy. Having a list of useful debugging commands is always helpful. And the rule is not different for Spark on Kubernetes project.

Continue Reading →

Apache Spark on Kubernetes - global overview

Last years are the symbol of popularization of Kubernetes. Thanks to its replication and scalability properties it's more and more often used in distributed architectures. Apache Spark, through a special group of work, integrates Kubernetes steadily. In current (2.3.1) version this new method to schedule jobs is integrated in the project as experimental feature.

Continue Reading →

What Kubernetes can bring to Apache Spark pipelines ?

Commercial version of Apache Spark distributed by Databricks offers a serverless and auto-scalable approach for the applications written in this framework. Among the time some other companies tried to provide similar alternatives, going even to put Apache Spark pipelines into AWS Lambda functions. But with the version 2.3.0 another alternative appears as a solution for scalability and elasticity overhead - Kubernetes.

Continue Reading →