Apache Spark articles

Home Apache Spark

September 30, 2018 • Apache Spark

Apache Spark and data compression

Compressed data takes less place and thus may be sent faster across the network. However these advantages transform in drawbacks in the case of parallel distributed data processing where the engine doesn't know how to split it for better parallelization. Fortunately, some of compression formats can be splitted.

Continue Reading →

September 16, 2018 • Apache Spark

Apache Spark on Kubernetes - init containers

Initialization is a very first step of almost all applications. Unsurprisingly it's also the case of Kubernetes that uses Init Containers to execute some setup operations before launching the pods.

Continue Reading →

August 18, 2018 • Apache Spark

Apache Spark on Kubernetes - useful commands

Beginning with new tool and its CLI is never easy. Having a list of useful debugging commands is always helpful. And the rule is not different for Spark on Kubernetes project.

Continue Reading →

August 11, 2018 • Apache Spark

Apache Spark on Kubernetes - global overview

Last years are the symbol of popularization of Kubernetes. Thanks to its replication and scalability properties it's more and more often used in distributed architectures. Apache Spark, through a special group of work, integrates Kubernetes steadily. In current (2.3.1) version this new method to schedule jobs is integrated in the project as experimental feature.

Continue Reading →

July 7, 2018 • Apache Spark

External shuffle service in Apache Spark

To scale Spark applications automatically we need to enable dynamic resource allocation. But to make it work we need another feature called external shuffle service that will be covered here.

Continue Reading →

July 1, 2018 • Apache Spark

What Kubernetes can bring to Apache Spark pipelines ?

Commercial version of Apache Spark distributed by Databricks offers a serverless and auto-scalable approach for the applications written in this framework. Among the time some other companies tried to provide similar alternatives, going even to put Apache Spark pipelines into AWS Lambda functions. But with the version 2.3.0 another alternative appears as a solution for scalability and elasticity overhead - Kubernetes.

Continue Reading →

July 1, 2018 • Apache Spark

Docker-composing Apache Spark on YARN image

Some months ago I written the notes about my experience from building Docker image for Spark on YARN cluster. Recently I decided to improve the project and transform it to Docker-compose format.

Continue Reading →

February 18, 2018 • Apache Spark

RPC in Apache Spark

The communication in distributed systems is an important element. The cluster members rarely share the hardware components and the single solution to communicate is the exchange of messages in the client-server model.

Continue Reading →

December 3, 2017 • Apache Spark

Spark failure detection - heartbeats

One of problems in distributed computing is the failure detection. How a master node can know that some of its workers went down just a minute ? A popular and quite simple solution uses heartbeats sent at regular interval by the workers. Spark also implements this technique.

Continue Reading →

November 19, 2017 • Apache Spark

Spark data locality

If you've ever analyzed Spark UI, you've certainly seen the part of Locality level in the table with tasks. Even if this concept is less exposed than the topics as shuffle, it remains quite important in efficient data processing.

Continue Reading →

September 10, 2017 • Apache Spark

Partitioning internals in Spark

In October I published the post about Partitioning in Spark. It was an introduction to the partitioning part, mainly focused on basic information, as partitioners and partitioning transformations (coalesce and repartition). This time it's a good moment to take other partition points up.

Continue Reading →

September 3, 2017 • Apache Spark

Spak UI meaning - common parts

Spark UI is a good method to track jobs execution and detect performances issues. But the multiple parts of the UI, some of them depending on used Spark library, can scare at first glance. This post tries to explain all necessary points to understand better the common parts of Spark UI.

Continue Reading →

August 20, 2017 • Apache Spark

Speculative execution in Spark

Spark has a lot of interesting features and one of them is the speculative execution of tasks.

Continue Reading →

July 30, 2017 • Apache Spark

Dynamic resource allocation in Spark

Defining the universal workload and associating corresponding resources is always difficult. Even if most of time expected resources will support the load, there always will be some interval in the year when data activity will grow (e.g. Black Friday). One of Spark's mechanisms helping to prevent processing failures in such situations is dynamic resource allocation.

Continue Reading →

July 15, 2017 • Apache Spark

Collecting a part of data to the driver with RDD toLocalIterator

The golden rule, when you deal with a lot of data, is to avoid bringing all these data on a single node. It can easily and pretty quickly lead to OOM errors. Spark isn't an exception for this rule. But Spark provides one solution that can reduce the amount of objects brought the driver, when this move is mandatory - toLocalIterator method.

Continue Reading →

July 15, 2017 • Apache Spark

Shading as solution for dependency hell in Spark

Using Spark in AWS environment can sometimes be problematic. It especially is when the dependency hell problem appears. But fortunately, it can be resolved pretty easily with shading.

Continue Reading →

July 9, 2017 • Apache Spark

Apache Spark blocks explained

In Spark blocks are everywhere. They represent broadcasted objects, they are used as support for intermediate steps in shuffle process, or finally they're used to store temporary files. But very often they're disregarded at the beginning because of more meaningful concepts, as transformations and actions - even if without blocks, both of them won't be possible.

Continue Reading →

July 9, 2017 • Apache Spark

Failed tasks resubmit

A lot of things are automatized in Spark: metadata and data checkpointing, task distribution, to quote only some of them. Another one, not mentioned very often, is the automatic retry in the case of task failures.

Continue Reading →

July 2, 2017 • Apache Spark

JARs split personality problem

Often making errors helps to progress. It was my case with spark-submit and local/remote JAR pair. They helped me to understand the role of driver, closures, serialization and some configuration properties.

Continue Reading →

June 25, 2017 • Apache Spark

Dockerize Spark on YARN - lessons learned

Even if a lot of Docker containers exist for Apache Spark, it's always a good exercise to make one in your own. It can help to understand some new concepts as well as improve skills of building Docker images.

Continue Reading →