Articles about PySpark on waitingforcode.com - articles for the pleasure of learning and discovery

May 8, 2025 • PySpark

Abstracting column access in PySpark with Proxy design pattern

One of the biggest changes for PySpark has been the DataFrame API. It greatly reduces the JVM-to-PVM communication overhead and improves the performance. However, it also complexities the code. Probably, some of you have already seen, written, or worked with the code like this...

Continue Reading →

December 3, 2022 • PySpark

Shuffle in PySpark

Shuffle is for me a never-ending story. Last year I spent long weeks analyzing the readers and writers and was hoping for some rest in 2022. However, it didn't happen. My recent PySpark investigation led me to the shuffle.py file and my first reaction was "Oh, so PySpark has its own shuffle mechanism?". Let's check this out!

Continue Reading →

November 26, 2022 • PySpark

Serializers in PySpark

We've learned in the previous PySpark blog posts about the serialization overhead between the Python application and JVM. An intrinsic actor of this overhead are Python serializers that will be the topic of this article and hopefully, will provide a more complete overview of the Python <=> JVM serialization.

Continue Reading →

October 8, 2022 • PySpark

PySpark and pyspark.zip story

The topic of this blog post is one of my first big surprises while I was learning the debugging of PySpark jobs. Usually I'm running the code locally in debug mode and the defined breakpoints help me understand what happens. That time, it was different!

Continue Reading →

October 1, 2022 • PySpark

PySpark and vectorized User-Defined Functions

The Scala API of Apache Spark SQL has various ways of transforming the data, from the native and User-Defined Function column-based functions, to more custom and row-level map functions. PySpark doesn't have this mapping feature but does have the User-Defined Functions with an optimized version called vectorized UDF!

Continue Reading →

July 30, 2022 • PySpark

What's new in Apache Spark 3.3.0 - PySpark

It's time for the last "What's new in Apache Spark 3.3.0..." before a break. Today we'll see what changed in PySpark. Spoiler alert: Pandas users should find one feature very exciting!

Continue Reading →

June 11, 2022 • PySpark

Generators and PySpark

I remember the first PySpark codes I saw. They were pretty similar to the Scala ones I used to work with except one small detail, the yield keyword. Since then, I've understood their purpose but have been actively looking for an occasion to blog about them. Growing the PySpark section is a great opportunity for this!

Continue Reading →

June 4, 2022 • PySpark

PySpark and the JVM - introduction, part 2

Last time I introduced Py4j which is the bridge between Apache Spark JVM codebase and Python client applications. Today it's a great moment to take a deeper look at their interaction in the context of data processing defined with the RDD and DataFrame APIs.

Continue Reading →

April 30, 2022 • PySpark

PySpark and the JVM - introduction, part 1

In my quest for understanding PySpark better, the JVM in the Python world is the must-have stop. In this first blog post I'll focus on Py4J project and its usage in PySpark.

Continue Reading →

December 4, 2021 • PySpark

What's new in Apache Spark 3.2.0 - PySpark and Pandas

Project Zen is an initiative to make PySpark more Pythonic and facilitate the Python programming experience. Apache Spark 3.2.0 made a next step in this direction by bringing Pandas to the API!

Continue Reading →

April 3, 2021 • PySpark

What's new in Apache Spark 3.1 - Project Zen

I mentioned it very shortly in the first blog post ever about PySpark. Thanks to the Project Zen initiative, the Python part of Apache Spark will become more Pythonic and user friendly. How? Let's check that in this blog post!

Continue Reading →

January 16, 2021 • PySpark

PySpark schema inference and 'Can not infer schema for type str' error

The title of this blog post is maybe one of the first problems you may encounter with PySpark (it was mine). Even though it's quite mysterious, it makes sense if you take a look at the root cause.

Continue Reading →

PySpark articles