Testing for Data Engineers — Ship Data You Can Trust

Sound familiar?

Data engineering without tests
is a ticking time bomb.

🕵️

Bad data, discovered late

The NULL values that slipped in three weeks ago are now polluting every dashboard in the company. The data is technically "there" — it's just wrong and you need to run a costly reprocessing to fix the issue.

😰

Afraid to refactor

You know the 2-year-old Spark job needs a rewrite. But there are no tests, no safety net. Touch it and everything might break.

🔁

Manual QA forever

Every deployment means someone manually running SQL queries to "check if it looks right." It's tedious, unreliable, and doesn't scale.

💥

AI-Generated code shipped without control

AI-generated code can accelerate your workflow dramatically, but volume is not the same as reliability. Without proper validation, you have no way of knowing how that code will actually behave in production.

🤷

"We don't test data, we test code"

Software engineering testing practices don't map directly to data pipelines. So most teams just... don't test. Until something explodes.

🧱

No testing culture on the team

You know tests would help. But you don't know where to start, what to test, or how to convince your teammates and manager it's worth the investment.

💸

Tests are a cost

You have a test suite but it's painful to maintain and don't know what you are doing wrong.

🤓

"It works on my machine"

Unfortunately, your colleague left with his machine.

      These are not edge cases. They are the default state of data teams
      that ship without a testing strategy. The good news: every one of them is preventable.
      
      A single production data incident typically costs your team 1–5 engineering
      days of firefighting — before you start counting the stakeholder trust
      you'll need weeks to rebuild. A 4-day workshop is a bargain by comparison.

What you'll learn

A complete testing toolkit
built for the Databricks Lakehouse.

Unlike generic software testing courses, this workshop is firmly grounded in the Databricks platform — Apache Spark, Delta Lake, Lakeflow Spark Declarative Pipelines, and Declarative Automation Bundles - while also covering the universal testing principles that apply to any data stack.

Four full days of live, hands-on training with the instructor — in person at your location or online via video call, whichever works best for you. Every session includes exercises you write and run yourself.

Day 1

Foundations & Unit testing for Databricks workloads

09:00

Testing approaches for data systems

~1 h

›

We open by answering the foundational question: why can't we just apply standard software testing practices to data pipelines? Non-determinism, external state, and data evolution make data systems a genuinely different beast. Once that's clear, we establish the software engineering principles that bring discipline to the chaos.

Doubtless engineering
Eyeball testing
Software engineering Test Pyramid
Why these approaches are not working
Test square is the new pyramid for data systems

10:00

Unit tests

~5h

›

With the foundations in place, we turn to the first line of defense: unit tests. They remain essential for catching logic errors early and fast. We cover what makes a unit test genuinely useful versus one that gives false confidence, the code patterns and libraries that make writing them less painful, and how to handle the unique challenges of testing PySpark and Databricks workloads. We close by embedding unit tests into the development lifecycle and clearing up the misconceptions that lead teams to either over-rely on them or abandon them too soon

Why do they still matter as the first guard?
Golden rules of useful and reliable unit tests
Useful code constructs for efficient unit testing
Life-saving libraries
Unit tests for PySpark and Databricks
Automation in development lifecycle
Misconceptions and gotchas
Hands-on lab adding tests to an existing PySpark code base
Q&A live discussion on what we learned

Day 2

Data tests

09:00

Data tests

~6h

›

Moving beyond unit tests, we explore how data tests operate at a different level — validating that the pieces work together against real data and real systems. We build a data quality layer that turns passive observations into active test controls, and rethink assertions in this context. The section closes by wiring everything into the CI/CD process so data tests become a natural checkpoint in every deployment.

Difference with Unit Tests and Integration Tests
Building data quality layer for tests
Transforming data quality observations into actionable test controls
Assertions differently
Integrating with the CI/CD process

Day 3

Integration tests & Medallion architecture

09:00

Integration tests

~2h

›

Here we tackle integration tests in practice, focusing on the specific challenges that come with Databricks environments. We look at how to keep maintenance overhead from becoming a burden as the test suite grows, and how to automate execution so integration tests run reliably without manual intervention.

Testing Databricks-specific features
Mitigating Integration Test maintenance overhead
Automating the execution

11:00

Medallion architecture

~2h

›

We apply the testing strategies to the Medallion architecture, following data as it moves through the Bronze, Silver, and Gold layers. Each layer introduces its own failure modes, so we map out where problems are most likely to originate and how to trace them back to their source. We then address the most common issues that surface at each layer, showing how unit tests, data tests, and integration tests each play a role in keeping the pipeline healthy end to end.

The Bronze → Silver → Gold principle
Identifying where the problems can came from
Addressing most common issues with Unit Tests, Data Tests, and Integration Tests

13:00

Tests and Lakeflow Spark Declarative Pipelines

~2h

›

Spark Declarative Pipelines come with a common misconception: no SparkSession must mean difficult to test. We unpack how the main SDP script works like a Python __main__ function — it declares the workflow while the real logic lives elsewhere, and that separation is actually your testing advantage. Any business logic extracted from the pipeline can be covered with the unit tests from the previous section. We walk through concrete test examples and finish by integrating SDP tests into the CI/CD pipeline.

Integrating tests best practices to the declarative world
Learning how to approach SDP pipelines to make them testable
Testing incremental logic across Bronze, Silver, and Gold layers
Keeping LSDP capabilities for testable workloads

Day 4

Lakeflow Spark Declarative Pipelines & capstone project

09:00

Testing and LLMs

~2h

›

Before diving into the capstone project, we take a step back to look at one of the most exciting shifts happening in the data engineering space right now: using LLMs as an active partner in building and maintaining your testing layer. We explore how LLMs can generate unit test cases from your pipeline code, produce realistic synthetic datasets, translate business requirements into data quality rules, and help you spot logic paths your current test suite is missing.

11:00

Capstone project

~3h

›

We close the workshop with a capstone project that brings everything together. Starting from a realistic Databricks data pipeline, participants will apply the full testing stack hands-on: writing unit tests to guard business logic, data tests to enforce quality at each layer of the Medallion architecture, and integration tests to validate the pipeline end to end. The project is designed to reflect the real challenges data engineers face — non-determinism, external state, PySpark specifics and challenges participants to make deliberate choices about what to test, at which level, and how to wire it all into a CI/CD pipeline. By the end, the testing strategies covered throughout the workshop stop being abstract concepts and become a working, cohesive test suite

15:00

Retrospective and final Q&A

~1h

›

We close the day with an open retrospective and Q&A. This is a space to surface lingering doubts, challenge the approaches presented, and share lessons from the capstone project. No slides, no structure — just an honest conversation about what works, what doesn't.

Your instructor

Bartosz Konieczny

Freelance Data Engineer & Author

Data Engineering Design Patterns waitingforcode.com GitHub

Coding since 2010. Shipping data systems ever since.

I'm a freelance data engineer who has held senior hands-on positions across the industry, working on data engineering problems in both batch and stream processing. My work spans Apache Spark, Databricks, Apache Kafka, and Delta Lake across major public cloud platforms.

I write about everything I learn on waitingforcode.com — one of the most comprehensive data engineering blogs on the internet, with deep dives on Apache Spark and Databricks internals, stream processing, and distributed systems. I've spoken at the Spark+AI Summit, the Data+AI Summit, among others.

This workshop distills everything I've learned building, breaking, and fixing real data systems. Not slides — code, patterns, and hard-won lessons.

I'm also the author of:

Data Engineering Design Patterns book cover

Data Engineering Design Patterns Recipes for Solving the Most Common Data Engineering Problems
O'Reilly Media · April 2025

Freelance data engineer with senior hands-on positions since 2010

Author of Data Engineering Design Patterns (O'Reilly, 2025)

Expertise in Apache Spark, Databricks, Apache Kafka, Delta Lake, Python

Speaker at Spark+AI Summit, Data+AI Summit

Blogger at waitingforcode.com — thousands of engineers worldwide

What's included

Everything you need to succeed.

📍

In-person or Online

Your choice. On-site at your office, at an external venue, or via video call — same content, same instructor.

🧑‍🏫

Live with the Instructor

Four full days of direct access. Ask questions as you go, get unstuck in real time, no asynchronous delays.

💻

Hands-on Labs

A real GitHub repo with exercises for every topic. You write and run tests yourself, not just watch.

🎯

Real-world Templates

Production-ready files and architectures; adapt them and ship to your production environment.

💬

Post-workshop Support

4-hours time credit to ask follow-up questions as you apply what you've learned.

Investment

One workshop. A testing strategy for life.

Four full days with an O'Reilly author and Databricks MVP since 2020, maximum 10 participants, fully focused on Databricks and PySpark. Here's how it compares.

Market context

Corporate trainings and workshops €1,000–€1,500 / person

↓ same duration, smaller group, practitioner instructor

Senior data consultant — day rate × 4 €2,000–€3,000 / person

↓ same expertise, structured curriculum, group learning

This workshop — 4 days · max 10 participants €7,000 / flat fee per cohort

flat fee per cohort · Max 10

^€7,000

In-person or online · Your choice of format · Travel costs separate for in-person

✓4 full days of live training with the instructor

✓In-person at your location or online — you decide

✓Hands-on exercise repository

✓Production-ready blueprints

✓Free 4-hours post-workshop support

Get in touch to book

Not sure yet? Email me with your team size and preferred format. I'll answer any questions and we'll find a date that works for everyone.

Questions

Frequently asked questions.

What experience level is this workshop for? +

The workshop targets working data engineers with at least 1 year of professional experience. You should be comfortable writing Python and be familiar with Databricks Lakeflow offering. You don't need prior testing experience — but you should understand what data pipelines and transformations are. If you're unsure, email me and I'll tell you honestly whether it's a fit.

What tools will we use? +

Python (pytest), PySpark, Databricks, Delta Lake, Lakeflow Spark Declarative Pipelines, GitHub Actions, DXQ, and Declarative Automation Bundles. Exercises run locally with a lightweight SparkSession and a free Databricks workspace. A laptop with IDE, Python 3.12+ and internet access is all you need.

What's the difference between the in-person and online format? +

The content and exercises are identical. In-person works best for teams that want full immersion and benefit from whiteboard discussions; I travel to your office or we arrange a venue together. Online runs over video call with shared screens for the live coding — it works great for distributed teams or when travel isn't practical. Let me know your situation and we'll pick the right fit.

We use Airflow to orchestrate our Databricks jobs — is that covered? +

The workshop focuses on testing the Databricks layer itself (Lakeflow PySpark jobs, Lakeflow Spark Declarative Pipelines, Declarative Automation Bundles) rather than the orchestration layer. The testing patterns you'll learn apply regardless of whether Databricks is triggered by Airflow, a Databricks Lakeflow, or any other scheduler.

Can I expense this through my company? +

Absolutely. I provide a proper invoice for business purchases. Email me at contact@waitingforcode.com and I'll send you everything you need for your finance team.

How many people can attend? +

The workshop works best with groups of 4-10 participants.

Will my data stack be bug-free after the workshop? +

Testing won't make your data stack bug-free — no honest answer will promise that. But it will give you the tools to catch issues early, respond faster, and keep most problems from ever reaching production.

The Real Cost of
Untested Data Pipelines.

Data engineering without tests
is a ticking time bomb.

Bad data, discovered late

Afraid to refactor

Manual QA forever

AI-Generated code shipped without control

"We don't test data, we test code"

No testing culture on the team

Tests are a cost

"It works on my machine"

A complete testing toolkit
built for the Databricks Lakehouse.

Coding since 2010. Shipping data systems ever since.

Everything you need to succeed.

In-person or Online

Live with the Instructor

Hands-on Labs

Real-world Templates

Post-workshop Support

One workshop. A testing strategy for life.

Frequently asked questions.

Your pipelines deserve tests.
Your team deserves confidence.

The Real Cost of Untested Data Pipelines.

Data engineering without testsis a ticking time bomb.

Bad data, discovered late

Afraid to refactor

Manual QA forever

AI-Generated code shipped without control

"We don't test data, we test code"

No testing culture on the team

Tests are a cost

"It works on my machine"

A complete testing toolkitbuilt for the Databricks Lakehouse.

Coding since 2010. Shipping data systems ever since.

Everything you need to succeed.

In-person or Online

Live with the Instructor

Hands-on Labs

Real-world Templates

Post-workshop Support

One workshop. A testing strategy for life.

Frequently asked questions.

Your pipelines deserve tests.Your team deserves confidence.

The Real Cost of
Untested Data Pipelines.

Data engineering without tests
is a ticking time bomb.

A complete testing toolkit
built for the Databricks Lakehouse.

Your pipelines deserve tests.
Your team deserves confidence.