Apache Spark articles

Apache Spark on Kubernetes - init containers

Initialization is a very first step of almost all applications. Unsurprisingly it's also the case of Kubernetes that uses Init Containers to execute some setup operations before launching the pods.

Continue Reading β†’

Apache Spark on Kubernetes - useful commands

Beginning with new tool and its CLI is never easy. Having a list of useful debugging commands is always helpful. And the rule is not different for Spark on Kubernetes project.

Continue Reading β†’

Apache Spark on Kubernetes - global overview

Last years are the symbol of popularization of Kubernetes. Thanks to its replication and scalability properties it's more and more often used in distributed architectures. Apache Spark, through a special group of work, integrates Kubernetes steadily. In current (2.3.1) version this new method to schedule jobs is integrated in the project as experimental feature.

Continue Reading β†’

External shuffle service in Apache Spark

To scale Spark applications automatically we need to enable dynamic resource allocation. But to make it work we need another feature called external shuffle service that will be covered here.

Continue Reading β†’

What Kubernetes can bring to Apache Spark pipelines ?

Commercial version of Apache Spark distributed by Databricks offers a serverless and auto-scalable approach for the applications written in this framework. Among the time some other companies tried to provide similar alternatives, going even to put Apache Spark pipelines into AWS Lambda functions. But with the version 2.3.0 another alternative appears as a solution for scalability and elasticity overhead - Kubernetes.

Continue Reading β†’

Docker-composing Apache Spark on YARN image

Some months ago I written the notes about my experience from building Docker image for Spark on YARN cluster. Recently I decided to improve the project and transform it to Docker-compose format.

Continue Reading β†’

RPC in Apache Spark

The communication in distributed systems is an important element. The cluster members rarely share the hardware components and the single solution to communicate is the exchange of messages in the client-server model.

Continue Reading β†’

Spark failure detection - heartbeats

One of problems in distributed computing is the failure detection. How a master node can know that some of its workers went down just a minute ? A popular and quite simple solution uses heartbeats sent at regular interval by the workers. Spark also implements this technique.

Continue Reading β†’

Spark data locality

If you've ever analyzed Spark UI, you've certainly seen the part of Locality level in the table with tasks. Even if this concept is less exposed than the topics as shuffle, it remains quite important in efficient data processing.

Continue Reading β†’

Partitioning internals in Spark

In October I published the post about Partitioning in Spark. It was an introduction to the partitioning part, mainly focused on basic information, as partitioners and partitioning transformations (coalesce and repartition). This time it's a good moment to take other partition points up.

Continue Reading β†’

Spak UI meaning - common parts

Spark UI is a good method to track jobs execution and detect performances issues. But the multiple parts of the UI, some of them depending on used Spark library, can scare at first glance. This post tries to explain all necessary points to understand better the common parts of Spark UI.

Continue Reading β†’

Speculative execution in Spark

Spark has a lot of interesting features and one of them is the speculative execution of tasks.

Continue Reading β†’

Dynamic resource allocation in Spark

Defining the universal workload and associating corresponding resources is always difficult. Even if most of time expected resources will support the load, there always will be some interval in the year when data activity will grow (e.g. Black Friday). One of Spark's mechanisms helping to prevent processing failures in such situations is dynamic resource allocation.

Continue Reading β†’

Collecting a part of data to the driver with RDD toLocalIterator

The golden rule, when you deal with a lot of data, is to avoid bringing all these data on a single node. It can easily and pretty quickly lead to OOM errors. Spark isn't an exception for this rule. But Spark provides one solution that can reduce the amount of objects brought the driver, when this move is mandatory - toLocalIterator method.

Continue Reading β†’

Shading as solution for dependency hell in Spark

Using Spark in AWS environment can sometimes be problematic. It especially is when the dependency hell problem appears. But fortunately, it can be resolved pretty easily with shading.

Continue Reading β†’

Apache Spark blocks explained

In Spark blocks are everywhere. They represent broadcasted objects, they are used as support for intermediate steps in shuffle process, or finally they're used to store temporary files. But very often they're disregarded at the beginning because of more meaningful concepts, as transformations and actions - even if without blocks, both of them won't be possible.

Continue Reading β†’

Failed tasks resubmit

A lot of things are automatized in Spark: metadata and data checkpointing, task distribution, to quote only some of them. Another one, not mentioned very often, is the automatic retry in the case of task failures.

Continue Reading β†’

JARs split personality problem

Often making errors helps to progress. It was my case with spark-submit and local/remote JAR pair. They helped me to understand the role of driver, closures, serialization and some configuration properties.

Continue Reading β†’

Dockerize Spark on YARN - lessons learned

Even if a lot of Docker containers exist for Apache Spark, it's always a good exercise to make one in your own. It can help to understand some new concepts as well as improve skills of building Docker images.

Continue Reading β†’

Zoom at broadcast variables

Broadcast variables send object to executors only once and can be easily used to reduce network transfer and thus are precious in terms of distributed computing.

Continue Reading β†’