Apache Spark articles

on waitingforcode.com

What Kubernetes can bring to Apache Spark pipelines ?

Commercial version of Apache Spark distributed by Databricks offers a serverless and auto-scalable approach for the applications written in this framework. Among the time some other companies tried to provide similar alternatives, going even to put Apache Spark pipelines into AWS Lambda functions. But with the version 2.3.0 another alternative appears as a solution for scalability and elasticity overhead - Kubernetes. Continue Reading →

RPC in Apache Spark

The communication in distributed systems is an important element. The cluster members rarely share the hardware components and the single solution to communicate is the exchange of messages in the client-server model. Continue Reading →

Spark failure detection - heartbeats

One of problems in distributed computing is the failure detection. How a master node can know that some of its workers went down just a minute ? A popular and quite simple solution uses heartbeats sent at regular interval by the workers. Spark also implements this technique. Continue Reading →

Spark data locality

If you've ever analyzed Spark UI, you've certainly seen the part of Locality level in the table with tasks. Even if this concept is less exposed than the topics as shuffle, it remains quite important in efficient data processing. Continue Reading →

Spak UI meaning - common parts

Spark UI is a good method to track jobs execution and detect performances issues. But the multiple parts of the UI, some of them depending on used Spark library, can scare at first glance. This post tries to explain all necessary points to understand better the common parts of Spark UI. Continue Reading →

Dynamic resource allocation in Spark

Defining the universal workload and associating corresponding resources is always difficult. Even if most of time expected resources will support the load, there always will be some interval in the year when data activity will grow (e.g. Black Friday). One of Spark's mechanisms helping to prevent processing failures in such situations is dynamic resource allocation. Continue Reading →

Apache Spark blocks explained

In Spark blocks are everywhere. They represent broadcasted objects, they are used as support for intermediate steps in shuffle process, or finally they're used to store temporary files. But very often they're disregarded at the beginning because of more meaningful concepts, as transformations and actions - even if without blocks, both of them won't be possible. Continue Reading →

Failed tasks resubmit

A lot of things are automatized in Spark: metadata and data checkpointing, task distribution, to quote only some of them. Another one, not mentioned very often, is the automatic retry in the case of task failures. Continue Reading →