Programming articles on waitingforcode.com - blog posts about Big Data, Spark, Kafka, Scala, Java, JVM and other programming stuff

April 14, 2025 • Databricks

Data quality on Databricks - Spark Expectations

Previously we learned how to control data quality with Delta Live Tables. Now, it's time to see an open source library in action, Spark Expectations.

Continue Reading →

April 9, 2025 • Databricks

Data quality on Databricks - Delta Live Tables

Data quality is one of the key factors of a successful data project. Without a good quality, even the most advanced engineering or analytics work will not be trusted, therefore, not used. Unfortunately, data quality controls are very often considered as a work item to implement in the end, which sometimes translates to never.

Continue Reading →

March 26, 2025 • General data engineering

Data contracts and Bitol project

Data contracts was a hot topic in the data space before LLMs and GenAI came out. They promised a better world with less communication issues between teams, leading to more reliable and trustworthy data. Unfortunately, the promise has been too hard to put into practice. Has been, or should I write "was"?

Continue Reading →

March 19, 2025 • Databricks

Apache Airflow XCom in Databricks with task values

If you have been working with Apache Airflow already, you certainly met XComs at some point. You know, these variables that you can "exchange" between tasks within the same DAG. If after switching to Databricks Workflows for data orchestration you're wondering how to do the same, there is good news. Databricks supports this exchange capability natively with Task values.

Continue Reading →

March 5, 2025 • Databricks

File trigger in Databricks

For over two years now you can leverage file triggers in Databricks Jobs to start processing as soon as a new file gets written to your storage. The feature looks amazing but hides some implementation challenges that we're going to see in this blog post.

Continue Reading →

February 26, 2025 • Apache Spark SQL

The saveAsTable in Apache Spark SQL, alternative to insertInto

Is there an easier way to address the insertInto position-based data writing in Apache Spark SQL? Totally, if you use a column-based method such as saveAsTable with append mode.

Continue Reading →

February 19, 2025 • Apache Spark Structured Streaming

Dealing with quotas and limits - Apache Spark Structured Streaming for Amazon Kinesis Data Streams

Using cloud managed services is often a love and hate story. On one hand, they abstract a lot of tedious administrative work to let you focus on the essentials. From another, they often have quotas and limits that you, as a data engineer, have to take into account in your daily work. These limits become even more serious when they operate in a latency-sensitive context, as the one of stream processing.

Continue Reading →

February 12, 2025 • Apache Spark SQL

Overwriting partitioned tables in Apache Spark SQL

After publishing a release of my blog post about the insertInto trap, I got an intriguing question in the comments. The alternative to the insertInto, the saveAsTable method, doesn't work well on partitioned data in overwrite mode while the insertInto does. True, but is there an alternative to it that doesn't require using this position-based function?

Continue Reading →

January 23, 2025 • Apache Spark SQL

The insertInto trap in Apache Spark SQL

Even though Apache Spark SQL provides an API for structured data, the framework sometimes behaves unexpectedly. It's the case of an insertInto operation that can even lead to some data quality issues. Why? Let's try to understand in this short article.

Continue Reading →

January 15, 2025 • Apache Spark Structured Streaming

Event time skew and global watermark in Apache Spark Structured Streaming

A few months ago I wrote a blog post about event skew and how dangerous it is for a stateful streaming job. Since it was a high-level explanation, I didn't cover Apache Spark Structured Streaming deeply at that moment. Now the watermark topic is back to my learning backlog and it's a good opportunity to return to the event skew topic and see the dangers it brings for Structured Streaming stateful jobs.

Continue Reading →

January 10, 2025 • Delta Lake

Delta Lake and restore - traveling in time differently

Time travel is a quite popular Delta Lake feature. But do you know it's not the single one you can use to interact with the past versions? An alternative is the RESTORE command, and it'll be the topic of this blog post.

Continue Reading →

January 6, 2025 • Blog

2024 retrospective on waitingforcode.com

Even though I was blogging less in the second half of the previous year, the retrospective is still the blog post I'm waiting for each year. Every year I summarize what happened in the past 12 months and share with you my future plans. It's time for the 2024 Edition!

Continue Reading →

August 22, 2024 • Apache Spark Structured Streaming

DAIS 2024: Unit tests - configuration and declaration

Code organization and assertions flow are both important but even them, they can't guarantee your colleagues' adherence to the unit tests. There are other user-facing attributes to consider as well.

Continue Reading →

August 9, 2024 • Apache Spark Structured Streaming

DAIS 2024: Orchestrating and scoping assertions in Apache Spark Structured Streaming

Testing batch jobs is not the same as testing streaming ones. Although the transformation (the WHAT from the previous article) is similar in both cases, more complete validation tests on the job logic are not. After all, streaming jobs often iteratively build the final outcome while the batch ones generate it in a single pass.

Continue Reading →

August 1, 2024 • General data engineering

Data+AI Summit 2024 - Retrospective - Apache Spark

Welcome to the second blog post dedicated to the previous Data+AI Summit. This time I'm going to share with you a summary of Apache Spark talks.

Continue Reading →

Welcome to the blog!