A 3-day bug hunt on a 3-person team costs up to β¬7,200 in lost engineering time. This workshop teaches you to prevent that β unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.
Some time ago I found a site listing Big Data patterns (link in "Read also" section). However, that site describes them from a very general point of view and it's not always obvious to figure out the what, why and how. That's why I decided to start a new series of posts where I will try to describe these patterns and give some more technical context.
Sometimes metadata is disregarded but very often it helps to retrieve the information easier and faster. One of such use cases are the headers of Apache Parquet where the stats about the column's content are stored. The reader can, without parsing all the lines, know whether what is he looking for is in the file or not. The metadata is also a part of one of Big Data patterns called automated processing metadata insertion.
The series about the implementation of Big Data patterns continues. This time I will focus on a streaming pattern called fan-in ingress.