Compressed data takes less place and thus may be sent faster across the network. However these advantages transform in drawbacks in the case of parallel distributed data processing where the engine doesn't know how to split it for better parallelization. Fortunately, some of compression formats can be splitted.
Initialization is a very first step of almost all applications. Unsurprisingly it's also the case of Kubernetes that uses Init Containers to execute some setup operations before launching the pods.
Beginning with new tool and its CLI is never easy. Having a list of useful debugging commands is always helpful. And the rule is not different for Spark on Kubernetes project.
Last years are the symbol of popularization of Kubernetes. Thanks to its replication and scalability properties it's more and more often used in distributed architectures. Apache Spark, through a special group of work, integrates Kubernetes steadily. In current (2.3.1) version this new method to schedule jobs is integrated in the project as experimental feature.
To scale Spark applications automatically we need to enable dynamic resource allocation. But to make it work we need another feature called external shuffle service that will be covered here.
Commercial version of Apache Spark distributed by Databricks offers a serverless and auto-scalable approach for the applications written in this framework. Among the time some other companies tried to provide similar alternatives, going even to put Apache Spark pipelines into AWS Lambda functions. But with the version 2.3.0 another alternative appears as a solution for scalability and elasticity overhead - Kubernetes.
Some months ago I written the notes about my experience from building Docker image for Spark on YARN cluster. Recently I decided to improve the project and transform it to Docker-compose format.
The communication in distributed systems is an important element. The cluster members rarely share the hardware components and the single solution to communicate is the exchange of messages in the client-server model.
One of problems in distributed computing is the failure detection. How a master node can know that some of its workers went down just a minute ? A popular and quite simple solution uses heartbeats sent at regular interval by the workers. Spark also implements this technique.
If you've ever analyzed Spark UI, you've certainly seen the part of Locality level in the table with tasks. Even if this concept is less exposed than the topics as shuffle, it remains quite important in efficient data processing.
In October I published the post about Partitioning in Spark. It was an introduction to the partitioning part, mainly focused on basic information, as partitioners and partitioning transformations (coalesce and repartition). This time it's a good moment to take other partition points up.
Spark UI is a good method to track jobs execution and detect performances issues. But the multiple parts of the UI, some of them depending on used Spark library, can scare at first glance. This post tries to explain all necessary points to understand better the common parts of Spark UI.
Spark has a lot of interesting features and one of them is the speculative execution of tasks.
Defining the universal workload and associating corresponding resources is always difficult. Even if most of time expected resources will support the load, there always will be some interval in the year when data activity will grow (e.g. Black Friday). One of Spark's mechanisms helping to prevent processing failures in such situations is dynamic resource allocation.
The golden rule, when you deal with a lot of data, is to avoid bringing all these data on a single node. It can easily and pretty quickly lead to OOM errors. Spark isn't an exception for this rule. But Spark provides one solution that can reduce the amount of objects brought the driver, when this move is mandatory - toLocalIterator method.
Using Spark in AWS environment can sometimes be problematic. It especially is when the dependency hell problem appears. But fortunately, it can be resolved pretty easily with shading.
In Spark blocks are everywhere. They represent broadcasted objects, they are used as support for intermediate steps in shuffle process, or finally they're used to store temporary files. But very often they're disregarded at the beginning because of more meaningful concepts, as transformations and actions - even if without blocks, both of them won't be possible.
A lot of things are automatized in Spark: metadata and data checkpointing, task distribution, to quote only some of them. Another one, not mentioned very often, is the automatic retry in the case of task failures.
Often making errors helps to progress. It was my case with spark-submit and local/remote JAR pair. They helped me to understand the role of driver, closures, serialization and some configuration properties.
Even if a lot of Docker containers exist for Apache Spark, it's always a good exercise to make one in your own. It can help to understand some new concepts as well as improve skills of building Docker images.