Software applications, including the data engineering ones you're working on, may require flexible input parameters. These parameters are important because they often identify the tables or data stores the job interacts with, and also show what the expected outputs are. Despite their utility, they can also cause confusion within the code, especially when not managed properly. Let's see how to address them for PySpark jobs on Databricks.
I discovered recursive CTE during my in-depth SQL exploration back in 2018. However, I have never had an opportunity to implement them in production. Until recently where I was migrating workflows from SQL Server to Databricks and one of them was using the recursive CTEs to build a hierarchy table. If it's the first time you hear about the recursive CTEs, let me share my findings with you!
Databricks Jobs is still one of the best ways for running data processing code on Databricks. It supports a wide range of processing modes, from native Python and Scala jobs, to framework-based dbt queries. It doesn't require installing anything on your own as it's a full serverless offering. Finally, it's also flexible enough to cover most of the common data engineering use cases. One of these great flexible features is support of different input arguments via For Each task.
Dealing with numbers may be easy and challenging at the same time. When you operate on integers, you can encounter integers overflow. When you deal with floating-point types, which will be the topic of this blog post, you can encounter rounding issues.
Databricks Asset Bundles (DAB) simplify managing Databricks jobs and resources a lot. And they are also flexible because besides the YAML-based declarative way you can add some dynamic behavior with scripts.
One of the recommended ways of sharing a library on Databricks is to use the Unity Catalog to store the packages in the volumes. That's the theory but the question is, how to connect the dots between the release preparation and the release process? I'll try to answer this in the blog post.
In the last blog post of the data quality on Databricks series we're going to discover a Databricks Labs product, the DQX library.
Previously we learned how to control data quality with Delta Live Tables. Now, it's time to see an open source library in action, Spark Expectations.
Data quality is one of the key factors of a successful data project. Without a good quality, even the most advanced engineering or analytics work will not be trusted, therefore, not used. Unfortunately, data quality controls are very often considered as a work item to implement in the end, which sometimes translates to never.
If you have been working with Apache Airflow already, you certainly met XComs at some point. You know, these variables that you can "exchange" between tasks within the same DAG. If after switching to Databricks Workflows for data orchestration you're wondering how to do the same, there is good news. Databricks supports this exchange capability natively with Task values.
For over two years now you can leverage file triggers in Databricks Jobs to start processing as soon as a new file gets written to your storage. The feature looks amazing but hides some implementation challenges that we're going to see in this blog post.