Data engineering articles

Looking for something else? Check the categories of Data engineering:

Apache Airflow Big Data Big Data algorithms Big Data problems - solutions Data engineering patterns Databricks General Big Data General data engineering Graphs SQL

If not, below you can find all articles belonging to Data engineering.

4-day workshop Β· In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that β€” unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants Β· production-ready templates
See the full curriculum β†’ €7,000 flat fee Β· cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

Dynamic File Pruning and MERGE on Databricks

Some time ago I had an unpleasant surprise for a MERGE query that despite the small table to merge, and the liquid clustering enabled on the target table, was taking ages. The solution came from a Photon's feature called the Dynamic File Pruning.

Continue Reading β†’

On tests in data systems

Modern data platforms make our life easier. They abstract the compute layer with serverless capabilities, provide a built-in data governance framework, and simplify data democratization by hiding complex technical stuff to the end users. But despite this simplification, one thing remains on your - data engineering - end, the tests! In this blog post we're going to discover various testing patterns from the software engineering world, and try to see how to fit them to data engineering needs.

Continue Reading β†’

Repairing and backfilling on Lakeflow Jobs

If you are running data processing jobs on Lakeflow Jobs, you certainly noticed these two options that might look the same but in fact have two different purposes. If not, it's even better because I'm now sure the blog post will be useful for you!

Continue Reading β†’

Variables in Databricks Asset Bundles

Variables are an essential part of any deployment process. You don't want to write a dedicated YAML or Python script for every environment, do you? Databricks Asset Bundles (DAB) is no exception, as its variable handling is designed to significantly simplify your workflow.

Continue Reading β†’

Poe The Poet as handy extension for Databricks Asset Bundles

Make and Makefiles have been around for a while to facilitate tasks definitions, even in Python. But Python has an alternative that we are going to discover in this blog post.

Continue Reading β†’

RELY clause on keys for Databricks Unity Catalog tables

Since the early days of data lakes, datasets persisted in object stores have not had primary and foreign key constraints enforced. Databricks is no exception; however, the platform supports unenforced PRIMARY KEY and FOREIGN KEY constraints, which the query optimizer uses to improve performance.

Continue Reading β†’

Databricks and INSERT...REPLACE

Even though you mostly find ANSI-supported SQL features on Databricks, there are some useful Databricks-specific functions. One of them is the INSERT...REPLACE statement that you can use to overwrite datasets matching given conditions.

Continue Reading β†’

On multiple Lakeflow Jobs triggers

You need to write a Lakeflow job that is going to start upon a file upload. Sounds easy, isn't it? But what if the same job also had to support the CRON trigger? Unfortunately, you cannot set multiple triggers on the job, so you will have to engineer the workflow differently.

Continue Reading β†’

Excel processing on Databricks

Databricks has recently extended natively supported data formats with Excels!

Continue Reading β†’

Input parameters for PySpark jobs on Databrcks

Software applications, including the data engineering ones you're working on, may require flexible input parameters. These parameters are important because they often identify the tables or data stores the job interacts with, and also show what the expected outputs are. Despite their utility, they can also cause confusion within the code, especially when not managed properly. Let's see how to address them for PySpark jobs on Databricks.

Continue Reading β†’

Recursive CTE on Databricks

I discovered recursive CTE during my in-depth SQL exploration back in 2018. However, I have never had an opportunity to implement them in production. Until recently where I was migrating workflows from SQL Server to Databricks and one of them was using the recursive CTEs to build a hierarchy table. If it's the first time you hear about the recursive CTEs, let me share my findings with you!

Continue Reading β†’

For each task in Databricks Jobs (Lakeflow Jobs)

Databricks Jobs is still one of the best ways for running data processing code on Databricks. It supports a wide range of processing modes, from native Python and Scala jobs, to framework-based dbt queries. It doesn't require installing anything on your own as it's a full serverless offering. Finally, it's also flexible enough to cover most of the common data engineering use cases. One of these great flexible features is support of different input arguments via For Each task.

Continue Reading β†’

Decimals and doubles on Databricks

Dealing with numbers may be easy and challenging at the same time. When you operate on integers, you can encounter integers overflow. When you deal with floating-point types, which will be the topic of this blog post, you can encounter rounding issues.

Continue Reading β†’

Running scripts as hooks with Databricks Asset Bundles

Databricks Asset Bundles (DAB) simplify managing Databricks jobs and resources a lot. And they are also flexible because besides the YAML-based declarative way you can add some dynamic behavior with scripts.

Continue Reading β†’

NULL in SQL, other traps

Last time I wrote about a special - but logical - behavior of NULLs in joins. Today it's time to see other queries where NULLs behave differently than columns with values.

Continue Reading β†’

NULL is not a value - on joining nulls

If you know it, lucky you. If not, I bet you'll spend some time on getting the reason why two - apparently the same rows - don't match in your full outer join statement.

Continue Reading β†’

Get it once, few words on data deduplication patterns in data engineering

This blog post completes the data duplication problem I covered in my recent Data Engineering Design Patterns book by approaching the issue from a different angle.

Continue Reading β†’

Alerts, guards, and data engineering

While I was writing about agnostic data quality alerts with ydata-profiling a few weeks ago, I had an idea for another blog post which generally can be summarized as "what do alerts do in data engineering projects". Since the answer is "it depends", let me share my thoughts on that.

Continue Reading β†’

Agnostic data alerts with ydata-profiling

Defining data quality rules and alerts is not an easy task. Thankfully, there are various ways that can help you automate the work. One of them is data profiling that we're going to focus on in this blog post!

Continue Reading β†’

Semantic versioning with a Databricks volume-based package

One of the recommended ways of sharing a library on Databricks is to use the Unity Catalog to store the packages in the volumes. That's the theory but the question is, how to connect the dots between the release preparation and the release process? I'll try to answer this in the blog post.

Continue Reading β†’