Data+AI Summit 2024 - Retrospective - Streaming

Welcome to the first Data+AI Summit 2024 retrospective blog post. I'm opening the series with the topic close to my heart at the moment, stream processing!

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ contact@waitingforcode.com πŸ“©

Effective Lakehouse Streaming with Delta Lake and Friends

After watching last year's "The Hitchhiker's Guide to Delta Lake Streaming", I was impatient to see what Scott Haines prepared this year! And I was not disappointed. Together with Ashok Singamaneni, they gave an excellent overview of building a reliable Bronze Layer with the help of Protobuf, Brickflow, and Spark-Expectations, two last ones being Open Source projects developed at Nike.

Notes from the talk:

State Reader API: the New "Statestore" Data Source

I was pretty excited for the next talk given by Craig Lukasik about the State Store Data Source. I have been following the feature from the SPIP, and seeing it running on stage was great!

Notes from the talk:

Fast, Cheap and Easy Data Ingestion with AWS Lambda and Delta Lake

After Craig's talk, I was impatiently waiting to see how to integrate a serverless environment with Lambda and Delta Lake. Tyler Croy explained that pretty well. And besides the happy path, he also shared the points to look at while implementing the solution.

Notes from the talk:

Databricks Streaming: Project Lightspeed Goes Hyperspeed

The Project Lightspeed has been on my radar since the day 1. When I saw that it will be going Hyperspeed, I couldn't miss the opportunity to see what it involved. Ryan Nienhuis and Praveen Gattu did a great work on explaining the term and showing the impact it will have on streaming pipelines on Databricks!

Notes from the talk:

Processing a Trillion Rows Per Day with Delta Lake at Adobe

After Ashok's and Scott's talk about Protobuf, I didn't expect to see any JSON in my list. However, Yeshwanth Vijayakumar proved me wrong, and he was right! Even though JSON is often considered as an archaic solution when it comes to the data ingestion format, it's still there and its flexibility may be useful for various scenarios. It's the case of Adobe's multi-tenant architecture that Yeshwanth presented.

Notes from the talk:

How Boeing Uses Streaming Data to Enhance the Flight Deck and OCC

I chose this talk out of pure curiosity to see the impact of streaming on the aviation industry. I was expecting some kind of low-level communication protocols and not easy to grasp stuff, but was positively surprised in the end! Will Jenden proved that it's possible to bring data intelligence with Apache Spark Structured Streaming even to that challenging environment as aviation!

Notes from the talk:

How to Use Delta Sharing for Streaming Data

That's another intriguing talk, as for me Delta Sharing has always been batch-friendly! Thankfully, I didn't miss the chance to learn from Matt Slack that it was a wrong assumption.

Notes from the talk:

Incremental Change Data Capture: A Data-Informed Journey

This year I picked two Change Data Capture (CDC) talks. I was a bit worried since the CDC has been there for a while but thankfully, it was not the case! In the first talk, Christina Taylor presented her journey throughout various CDC implementations on top of AWS and Databricks.

Notes from the talk:

How DLT Stretched CDC Capabilities and Kept ETL Limber at Hinge Health

A different Databricks-based solution for the Change Data Capture was presented by Alex Owen and Veeranagouda Mukkanagoudar.

Notes from the talk:

Building Metrics Store with Incremental Processing

Slightly related to the Change Data Capture theme was also the last talk from my list. Hang Li shared what challenges she faced at Instacart while building a metrics store from incremental processing.

Notes from the talk:

That's all for the first retrospective. Thanks for reading. I know it was pretty long compared to what you can read usually here, but I'd say, that's the cost of wisdom 🤓 Stay tuned for the next retrospective, this time about Apache Spark!


If you liked it, you should read:

πŸ“š Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!