I'm the author of Data Engineering Design Patterns (O'Reilly),
a Databricks MVP, and
a freelance data engineer specializing in Apache Spark and Databricks.
I help teams move from working pipelines to resilient architectures.
I'm currently accepting new projects for May 2026. Whether you need a 2-day architectural audit, a hands-on lead for a
complex data engineering problem, or a workshop
let's discuss your project here.
Welcome to the first Data+AI Summit 2024 retrospective blog post. I'm opening the series with the topic close to my heart at the moment, stream processing!
With this blog I'm starting a follow-up series for my Data+AI Summit 2024 talk. I missed this family of blog posts a lot as the previous DAIS with me as speaker was 4 years ago! As previously, this time too I'll be writing several blog posts that should help you remember the talk and also cover some of the topics left aside because of the time constraints.
Welcome to the second blog post dedicated to the previous Data+AI Summit. This time I'm going to share with you a summary of Apache Spark talks.
Testing batch jobs is not the same as testing streaming ones. Although the transformation (the WHAT from the previous article) is similar in both cases, more complete validation tests on the job logic are not. After all, streaming jobs often iteratively build the final outcome while the batch ones generate it in a single pass.
Code organization and assertions flow are both important but even them, they can't guarantee your colleagues' adherence to the unit tests. There are other user-facing attributes to consider as well.