Apache Spark articles

on waitingforcode.com

Validating JSON with Apache Spark and Cerberus

In one of recent Meetups I heard that one of the most difficult data engineering tasks is ensuring good data quality. I'm more than agree with that statement and that's the reason why in this post I will share one of solutions to detect data issues with PySpark (my first PySpark code !) and Python library called Cerberus. Continue Reading →

FAIR jobs scheduling in Apache Spark

During my exploration of Apache Spark configuration options, I found an entry called spark.scheduler.mode. After looking for its possible values, I ended up with a pretty intriguing concept called FAIR scheduling that I will detail in this post. Continue Reading →

Bzip2 compression in Apache Spark

Compression has a lot of benefits in the data context. It reduces the size of stored data, so you will save some space and also have less data to transfer across the network in the case of a data processing pipeline. And if you use Bzip2, you can process the compressed data in parallel. In this post, I will try to explain how does it happen. Continue Reading →

Memory and Apache Spark classes

In previous posts about memory in Apache Spark, I've been exploring memory behavior of Apache Spark when the input files are much bigger than the allocated memory. After that it's a good moment to sum up that in the post dedicated to classes involved in memory using tasks. Continue Reading →

Apache Spark 2.4.0 features - barrier execution mode

Data-driven systems continuously change. We moved from static, batch-oriented daily processing jobs to real-time streaming-based pipelines running all the time. Nowadays, the workflows have more and more AI compontents. Apache Spark tries to stay in the movement and in the new release proposes the implementation of the barrier execution mode as a new way to schedule tasks. Continue Reading →

Apache Spark and off-heap memory

With data-intensive applications as the streaming ones, bad memory management can add long pauses for GC. Luckily, we can reduce this impact by writing memory-optimized code and using the storage outside the heap called off-heap. Continue Reading →

Neo4j scalability and Apache Spark

Even though Apache Spark provides GraphX module, it's still possible to use the framework with other graph-based engines. One of them is Neo4j. But before using its Spark connector, it's good to know some internal execution details - especially the ones related to scalability. Continue Reading →

Apache Spark and data compression

Compressed data takes less place and thus may be sent faster across the network. However these advantages transform in drawbacks in the case of parallel distributed data processing where the engine doesn't know how to split it for better parallelization. Fortunately, some of compression formats can be splitted. Continue Reading →

Apache Spark on Kubernetes - global overview

Last years are the symbol of popularization of Kubernetes. Thanks to its replication and scalability properties it's more and more often used in distributed architectures. Apache Spark, through a special group of work, integrates Kubernetes steadily. In current (2.3.1) version this new method to schedule jobs is integrated in the project as experimental feature. Continue Reading →

What Kubernetes can bring to Apache Spark pipelines ?

Commercial version of Apache Spark distributed by Databricks offers a serverless and auto-scalable approach for the applications written in this framework. Among the time some other companies tried to provide similar alternatives, going even to put Apache Spark pipelines into AWS Lambda functions. But with the version 2.3.0 another alternative appears as a solution for scalability and elasticity overhead - Kubernetes. Continue Reading →

RPC in Apache Spark

The communication in distributed systems is an important element. The cluster members rarely share the hardware components and the single solution to communicate is the exchange of messages in the client-server model. Continue Reading →

Spark failure detection - heartbeats

One of problems in distributed computing is the failure detection. How a master node can know that some of its workers went down just a minute ? A popular and quite simple solution uses heartbeats sent at regular interval by the workers. Spark also implements this technique. Continue Reading →

Spark data locality

If you've ever analyzed Spark UI, you've certainly seen the part of Locality level in the table with tasks. Even if this concept is less exposed than the topics as shuffle, it remains quite important in efficient data processing. Continue Reading →

Spak UI meaning - common parts

Spark UI is a good method to track jobs execution and detect performances issues. But the multiple parts of the UI, some of them depending on used Spark library, can scare at first glance. This post tries to explain all necessary points to understand better the common parts of Spark UI. Continue Reading →