A single bug in production costs your team between €400–€800 per day (typical data engineering daily rate) to investigate and fix, multiplied by however many days it takes to find it. A 3-day bug hunt on a 3-person team up to €7,200 in lost engineering time. One workshop pays for itself the first time it prevents one incident.
This practical workshop teaches data engineers how to write tests that actually catch bugs, before your stakeholders do. Covering unit tests, integration tests, and data tests for PySpark and Databricks Lakeflow.
Sound familiar?
The NULL values that slipped in three weeks ago are now polluting every dashboard in the company. The data is technically "there" — it's just wrong and you need to run a costly reprocessing to fix the issue.
You know the 2-year-old Spark job needs a rewrite. But there are no tests, no safety net. Touch it and everything might break.
Every deployment means someone manually running SQL queries to "check if it looks right." It's tedious, unreliable, and doesn't scale.
AI-generated code can accelerate your workflow dramatically, but volume is not the same as reliability. Without proper validation, you have no way of knowing how that code will actually behave in production.
Software engineering testing practices don't map directly to data pipelines. So most teams just... don't test. Until something explodes.
You know tests would help. But you don't know where to start, what to test, or how to convince your teammates and manager it's worth the investment.
You have a test suite but it's painful to maintain and don't know what you are doing wrong.
Unfortunately, your colleague left with his machine.
What you'll learn
Unlike generic software testing courses, this workshop is firmly grounded in the Databricks platform — Apache Spark, Delta Lake, Lakeflow Spark Declarative Pipelines, and Declarative Automation Bundles - while also covering the universal testing principles that apply to any data stack.
Four full days of live, hands-on training with the instructor — in person at your location or online via video call, whichever works best for you. Every session includes exercises you write and run yourself.
We open by answering the foundational question: why can't we just apply standard software testing practices to data pipelines? Non-determinism, external state, and data evolution make data systems a genuinely different beast. Once that's clear, we establish the software engineering principles that bring discipline to the chaos.
With the foundations in place, we turn to the first line of defense: unit tests. They remain essential for catching logic errors early and fast. We cover what makes a unit test genuinely useful versus one that gives false confidence, the code patterns and libraries that make writing them less painful, and how to handle the unique challenges of testing PySpark and Databricks workloads. We close by embedding unit tests into the development lifecycle and clearing up the misconceptions that lead teams to either over-rely on them or abandon them too soon
Moving beyond unit tests, we explore how data tests operate at a different level — validating that the pieces work together against real data and real systems. We build a data quality layer that turns passive observations into active test controls, and rethink assertions in this context. The section closes by wiring everything into the CI/CD process so data tests become a natural checkpoint in every deployment.
Here we tackle integration tests in practice, focusing on the specific challenges that come with Databricks environments. We look at how to keep maintenance overhead from becoming a burden as the test suite grows, and how to automate execution so integration tests run reliably without manual intervention.
We apply the testing strategies to the Medallion architecture, following data as it moves through the Bronze, Silver, and Gold layers. Each layer introduces its own failure modes, so we map out where problems are most likely to originate and how to trace them back to their source. We then address the most common issues that surface at each layer, showing how unit tests, data tests, and integration tests each play a role in keeping the pipeline healthy end to end.
Spark Declarative Pipelines come with a common misconception: no SparkSession must mean difficult to test. We unpack how the main SDP script works like a Python __main__ function — it declares the workflow while the real logic lives elsewhere, and that separation is actually your testing advantage. Any business logic extracted from the pipeline can be covered with the unit tests from the previous section. We walk through concrete test examples and finish by integrating SDP tests into the CI/CD pipeline.
Before diving into the capstone project, we take a step back to look at one of the most exciting shifts happening in the data engineering space right now: using LLMs as an active partner in building and maintaining your testing layer. We explore how LLMs can generate unit test cases from your pipeline code, produce realistic synthetic datasets, translate business requirements into data quality rules, and help you spot logic paths your current test suite is missing.
We close the workshop with a capstone project that brings everything together. Starting from a realistic Databricks data pipeline, participants will apply the full testing stack hands-on: writing unit tests to guard business logic, data tests to enforce quality at each layer of the Medallion architecture, and integration tests to validate the pipeline end to end. The project is designed to reflect the real challenges data engineers face — non-determinism, external state, PySpark specifics and challenges participants to make deliberate choices about what to test, at which level, and how to wire it all into a CI/CD pipeline. By the end, the testing strategies covered throughout the workshop stop being abstract concepts and become a working, cohesive test suite
We close the day with an open retrospective and Q&A. This is a space to surface lingering doubts, challenge the approaches presented, and share lessons from the capstone project. No slides, no structure — just an honest conversation about what works, what doesn't.
Your instructor
Bartosz Konieczny
Freelance Data Engineer & Author
I'm a freelance data engineer who has held senior hands-on positions across the industry, working on data engineering problems in both batch and stream processing. My work spans Apache Spark, Databricks, Apache Kafka, and Delta Lake across major public cloud platforms.
I write about everything I learn on waitingforcode.com — one of the most comprehensive data engineering blogs on the internet, with deep dives on Apache Spark and Databricks internals, stream processing, and distributed systems. I've spoken at the Spark+AI Summit, the Data+AI Summit, among others.
This workshop distills everything I've learned building, breaking, and fixing real data systems. Not slides — code, patterns, and hard-won lessons.
I'm also the author of:
What's included
Your choice. On-site at your office, at an external venue, or via video call — same content, same instructor.
Four full days of direct access. Ask questions as you go, get unstuck in real time, no asynchronous delays.
A real GitHub repo with exercises for every topic. You write and run tests yourself, not just watch.
Production-ready files and architectures; adapt them and ship to your production environment.
4-hours time credit to ask follow-up questions as you apply what you've learned.
Investment
Four full days with an O'Reilly author and Databricks MVP since 2020, maximum 10 participants, fully focused on Databricks and PySpark. Here's how it compares.
Market context
In-person or online · Your choice of format · Travel costs separate for in-person
Questions
Ready to start?
€7,000 · Max 10 participants · In-person or online
Testing for Data Engineers · contact@waitingforcode.com · © 2026 All rights reserved.
4-day live workshop · In-person or online · Dates on request · waitingforcode.com