Data processing articles

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

Spark SQL pivot table

If you came to data engineering after having a BI career, you certainly know what the pivot is. It was not my case and was quite amazed by this operation that transforms values from rows into columns. If you want to understand how it's possible, this article will present some internals of pivoting data in Apache Spark.

Continue Reading β†’

Apache Spark performance tips - look at your code

Very often you will find Apache Spark performance tips related to the hardware (memory, GC) or the configuration parameters (shuffle partitions number, broadcast join threshold). But they're not the single ones you can implement. Moreover, IMO, you should start by the ones presented in this article and optimize your pipeline code before going into more complicated hardware and configuration tuning.

Continue Reading β†’

Data+AI Summit: custom state store integration feedback

After the introductory part, it's time to share what I learned from the custom state store implementation.

Continue Reading β†’

Data+AI Summit: Custom state store - API

After previous introductory posts, it's time to deep delve into the state store API and implement our own custom state store.

Continue Reading β†’

PySpark schema inference and 'Can not infer schema for type str' error

The title of this blog post is maybe one of the first problems you may encounter with PySpark (it was mine). Even though it's quite mysterious, it makes sense if you take a look at the root cause.

Continue Reading β†’

Structured Streaming and temporary views

I don't know you, but me, when I first saw the code with createTempView method, I thought it created a temporary table in the metastore. But it's not true and in this blog post, you will see why.

Continue Reading β†’

Data+AI Summit follow-up: arbitrary stateful processing and state management

After previous posts about native stateful operations, it's time to focus on the one where you can define your custom stateful logic.

Continue Reading β†’

Data+AI Summit follow-up: joins and state management

Streaming joins are an interesting feature that heavily uses state store. Even though I already blogged about it in the past (2018), some changes were made and also - I hope so - my explanation capacity improved.

Continue Reading β†’

Watermark and window-based processing

One of the not obvious things about the watermark is how it applies on the windows. At first glance, you could think that it will filter out the records produced before the watermark value. But it's not how it works for windows.

Continue Reading β†’

Apache Spark and shuffle management - external services

Shuffle accompanies distributed data processing from the very beginning. Apache Spark is not an exception, and one of the prominent features targeted for 3.1 release is the full support for the pluggable shuffle backend. But it's not the single effort made these days by the community to handle shuffle drawbacks. And you will see it in this blog post.

Continue Reading β†’

Data+AI Summit follow-up: aggregations and state management

In previous blog posts you discovered how the state store interacts with dropDuplicates and limit operators. This time you will see how it's used in aggregations.

Continue Reading β†’

Shuffle in Apache Spark, back to the basics

If you are a newcomer in the distributed world, someone certainly told you that shuffle is bad and will slow down your processing. But what does it mean? What happens when this infamous shuffle exists in your code? In this article you should find some answers for the shuffle in Apache Spark.

Continue Reading β†’

Partition-wise joins and Apache Spark SQL

Apache Spark has this great capacity to optimize joins of bucketed tables but does it work on partitions as well? No, and to understand why, I invite you to read the following sections of this blog post ?

Continue Reading β†’

Data+AI Summit follow-up: drop duplicates and state management

Another stateful operation requiring the state store is drop duplicates. You can use it to deduplicate your streaming data before pushing it to the sink.

Continue Reading β†’

Drop is a...select

Have you ever wondered what is the relationship between drop and select operations in Apache Spark SQL? If not, I will shed some light on them in this short blog post.

Continue Reading β†’

Data+AI Summit follow-up: global limit and state management

It's the second follow-up Data+AI Summit post but the first one focusing on the stateful operations and their interaction with the state store.

Continue Reading β†’

Data+AI follow-up: StateStoreRDD - building block for stateful processing

The main Apache Spark component enabling stateful processing is StateStoreRDD. It creates a partition-based state store instance but also triggers state-based computation.

Continue Reading β†’

What's new in Apache Spark 3.0 - Kubernetes

I believe Kubernetes is the next big step in the framework after proposing Catalyst Optimizer, modernizing streaming processing with Structured Streaming, and introducing Adaptive Query Execution. Especially that Apache Spark 3 brings a lot of changes in this part!

Continue Reading β†’

Broadcasting in Structured Streaming

Some time ago @ArunJijo36 mentioned me on Twitter with a question about broadcasting in Structured Streaming. If, like me at this time, you don't know what happens, I think that this article will be good for you 👊

Continue Reading β†’

What's new in Apache Spark 3.0 - GPU-aware scheduling

GPU-awareness was one of the topics I postponed the most in my Apache Spark 3.0 exploration. But its time has come and in this blog post you will discover what changed in the version 3 of the framework regarding the GPU-based computation.

Continue Reading β†’