I'm the author of Data Engineering Design Patterns (O'Reilly),
a Databricks MVP, and
a freelance data engineer specializing in Apache Spark and Databricks.
I help teams move from working pipelines to resilient architectures.
I'm currently accepting new projects for Jun 2026. Whether you need a 2-day architectural audit, a hands-on lead for a
complex data engineering problem, or a workshop
let's discuss your project here.
There are different ways to create a DataFrame in Apache Spark SQL. This rule applies also on an empty Dataset that may be useful if you prefer to deal with emptiness rather than missing values (null ...
In your daily work you've certainly met the problem of Table or view not found. The error happens in the situations like this: val customerIds = JoinHelper.insertCustomers(1) JoinHelper.insertOrde...
Sometimes the entries in processed dataset can be duplicates. For instance, the IoT device can by mistake send the same metrics more than once or our ingestion step can badly format the message. Spark...
Apache Spark SQL lets us to manipulate JSON fields in many different ways. One of the features is a field extraction from a stringified JSON with json_tuple(json: Column, fields: String*) function: ...
Schema helps to ensure a good data quality and to optimize data exploration. However, in some cases defining schema may be hard. That's especially true if your dataset has a lot of nested levels which...
You need to migrate a SQL Server code that uses STRING_AGG function. Unfortunately, it's a SQL Server-specific feature. To migrate it, you will have to use other transformations available in Apache Sp...
Sometimes your data may be stored in a nested hierarchy, like: bartosz:/tmp/test-nested-wildcard$ tree . βββ 11 β βββ 11.json β βββ 22 β βββ 22a.json β...
Nested data structures are interesting solution for data organization. They let us to bring values with similar characteristics into logical common groups. Most of time it improves raw data readabilit...
To debug your Apache Spark SQL programs, or even to understand how it works better, you can use debugging features exposed through org.apache.spark.sql.execution.debug package. One of them lets you se...
Using UDF in SQL statement or in programmatic way is quite easy because either you define the function's name or simply call the object returned after the registration. Using it in column-based operat...
Spark SQL provides the support for a lot of standard SQL operations, including IN clause. It can be easily used through the import of the implicits of created SparkSession object: private val spark...
I appreciate Apache Spark SQL because you can use it either as a data engineer, with some programmatic logic, or as a data analysts only by writing SQL queries. And sometimes writing these queries can...