Data+AI Summit 2024 - Retrospective - Apache Spark

Welcome to the second blog post dedicated to the previous Data+AI Summit. This time I'm going to share with you a summary of Apache Spark talks.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in 👉 Early Release on the O'Reilly platform

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Introducing the New Python Data Source API for Apache Spark™

Let's start this blog post with a feature that I briefly introduced last week, a new data source API available in Python! Allison Wang and Ryan Nienhuis explained all the ins and outs, with additional code examples.

Notes from the talk:

Connect with speakers and watch the talk online:

Exploring UDTFs (User-Defined Table Functions) in PySpark

Another new feature in PySpark are User-Defined Table Functions (UDTF), greatly introduced by Haejoon Lee and Takuya Ueshin.

Notes from the talk:

Connect with speakers and watch the talk online:

Dependency Management in Spark Connect: Simple, Isolated, Powerful

Spark Connect has been a top topic for the past year. It also got some attention at the Data+AI Summit. Hyukjin Kwon and Akhil Gudesa shared how to manage dependencies in Spark Connect.

Notes from the talk:

Connect with speakers and watch the talk online:

Best Practices for Unit Testing PySpark

Another interesting talk in operationalizing Apache Spark theme was given by Matthew Powers who has contributed to the community with various libraries and ebooks. At the Summit he recalled how to test PySpark jobs locally, without requiring to set up a distributed cluster.

Notes from the talk:

Connect with speakers and watch the talk online:

Stranger Triumphs: Automating Spark Upgrades & Migrations at Netflix

Two next talks from my list are great examples that Apache Spark is not only about writing data processing jobs. In the first of them Holden Karau and Robert Morck explained how to automate Apache Spark version upgrades at the Netflix scale!

Notes from the talk:

Connect with speakers and watch the talk online:

Uber's Batch Analytics Evolution from Hive to Spark

At similar scale that in the Spark upgrades, Kumudini Kakwani and Akshayaprakash Sharma performed batch jobs migrations from Apache Hive to Apache Spark at Uber! Also, in an automated way.

Notes from the talk:

Connect with speakers and watch the talk online:

Apache Spark was only one of Open Source libraries present in the Summit. Another one is Delta Lake that I'm going to focus on in the next blog post of the series.


If you liked it, you should read:

đź“š Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!