Big Data Warsaw 2023 retrospective - for data engineers

After a 2-years break, I had a chance to speak again, this time at the Big Data Warsaw 2023. Even though I couldn't be in Warsaw that day, I enjoyed the experience and also watched other sessions available through the conference platform.

I learned the hard way that it's impossible to do everything when it comes to watching all the talks of any conference. That's why I only select some of them that match my current topics of interest. Below is the list for Big Data Warsaw 2023, including streaming processing, data lineage, and some DataOps.

Streaming processing

Apache Flink: Introduction and new Features by Robert Metzger

In the first talk Robert Metzger gave a quick introduction to Apache Flink and its 3 primitives:

All these primitives are available from different APIs:

Besides the introduction, Robert also shared more low-level details about:

Where is my bottleneck? Performance troubleshooting in Flink by Piotr Nowojski

Another Apache Flink-related talk was about jobs troubleshooting. Piotr Nowojski shared how to fix different job issues. But before giving the list, Piotr highlighted a very important part. Before any action, we should start by identifying the troubleshooting category:

After that, Piotr started his investigation. The first step is to identify the bottleneck and the easiest way to do so is using the UI and Apache Flink color schemes applied to overloaded or poorly performing operators. Summary of that part in the slide below:

Later on the debugging flow looks similar to what you could have done before, including checking the JVM status (CPU usage, GC pauses) or code analysis. Moreover, there is a Flame Graphs integration that gives an even more detailed view of what the code is doing. And if it's not enough, you can always attach the task profiler to the subtask(s) causing problems. In case of multiple threads ran by the task manager, you should find the thread named after the problematic subtask(s) and analyze it.

When it comes to the checkpointing, as you already know, Apache Flink uses checkpoint barriers injected to the execution graph. Despite its innocent nature, checkpoints can also become a bottleneck. How to identify them? Again, start with the UI. If there is nothing wrong with the general view, dig into each subtask and check its checkpoint statistics.

The final bottleneck issue was related to job recovery. No luck here, there are no metrics for that part in the UI. The starting point should be then debug logs. The remediation actions will probably be related to the state itself. For example, adding more reading threads for RocksDB or playing with the state size could help.

Monitor Fleet in Real-Time Using Stream SQL by Gang Tao

Gang Tao gave another talk about streaming. He showed how to use SQL in this real-time context, recalling some important fundamentals. The first is the difference between the classical batch SQL and the real-time one. Nothing better than the screenshot to summarize it:

There is also a difference for the latency. A batch SQL measures the latency as the query execution time while a streaming SQL as a gap between the event time and processing time:

Near-Real-Time streaming applications monitoring and automated maintenance at billions of records/day scale: the architecture allowing us to sleep at night by Giuseppe Lorusso

Giuseppe Lorusso presented a way to monitor streaming applications with Open Source solutions. In a nutshell the architecture looks like that:

More specifically:

Grafana alerts on 2 different situations:

Data lineage

Column-level lineage is coming to the rescue by Paweł Leszczyński and Maciej Obuchowski

The first talk about the data lineage on my list was given by Paweł Leszczyński and Maciej Obuchowski, with a pretty original format of a Jupyter notebook!

In the introduction Paweł mentioned a data mess that is a side-effect of implementing data mesh without the data lineage. Data lineage which from this standpoint is important not only to solve problems when they arise but also to avoid them!

In a nutshell, OpenLineage is a dependency that injects some code trying to extract the metadata from the data processing layer such as a SQL query or an Apache Spark job. The extraction result goes later to one of the supported data lineage backends (Amundsen, Marquez, ...).

After the introduction, Paweł and Maciej moved to the demo. First, Paweł showed how to use OpenLineage for an Apache Spark job with no extra code! It was only a matter of adding new dependency and configuring the lineage transport (HTTP, console, Apache Kafka among the supported ones):

Later, Maciej proved that there is not a lot of work to do for the SQL queries integration with OpenLineage. As for Spark, it was only a question of bringing an extra library and calling it on top of the executed query:

Both demos showed the column-level lineage which is one of many features of OpenLineage!

An open standard for data lineage by Ross Turk

Another OpenLineage talk was given by Ross Turk. Unlike the one about column-level lineage, this one covered the data lineage topic more broadly.

First, Ross explained 3 different approaches that you can use in a data lineage implementation:

  1. Pipeline observation. Here the tool integrates with the data orchestrator and as soon as the pipeline runs, it updates the internal lineage repository. It provides a real-time semantic but is often reduced to the data orchestration layer and won't apply to the datasets created elsewhere.
  2. SQL parsing. It's not real-time but provides the information for other dataset creation places, such as ad-hoc queries.
  3. Source code analysis. The approach analyzes the code, looks for the queries, and prepares the lineage before the job execution.

There is no one-size-fits-all solution able to capture all scenarios. Ross calls it a patchwork where the combination of the presented approaches gives a more complete overview of the lineage.

In the next part, Ross shares some important information about OpenLineage itself:

DataOps

Applying consumer-driven contract testing to safely compose data products in a data mesh by Arif Wider

Arif Wider gave an interesting talk about data products and data mesh. He started with some data mesh reminder and an interesting data product definition:


Data product is the smallest independently deployable unit that you use in a data mesh architecture to basically compose anything else of. It's the smallest building block to create solutions.

The composition is made possible with the input and output ports, as shows the following slide:

Direct data exposition also brings some data quality risk and the contracts are here to mitigate them. Generally, a contract is about the assumptions and expectations. Data mesh promotes SLO and SLA as the contracts because they make things more explicit.

Additionally, there is a complementary way to reduce the risk with consumer-driven contracts. The idea comes from the software engineering and was initially shared in the The Practical Test Pyramid blog post. How does it apply to data products? Arif answered that in these 2 slides:

The idea is to associate tests with the output ports but the tricky part is the responsibility shift. Usually the producers apply tests, such as unit or integration tests at the code level. Here, the tests creation is the responsibility of the consumers (aka other data products)! They put their expectations and assumptions about the consumed data product in some place and expect the output ports meeting them. On the second slide you can see an example with the test classes stored in folders associated with each output port.

DataOps in action with Nessie, Iceberg and Great Expectations by Antonio Murgia

In his presentation Antonio Murgia demonstrated how to leverage Git for data tools, in this case Nessie, to avoid datastrophes:

To avoid that issue, Antonio introduced a Write-Audit-Publish Pattern. The idea is to avoid deploying the data that doesn't meet the requirements. It turns out to be quite challenging to do implement on top of the object stores because of the following issues:

 style=

However, thanks to the Git-for-data-lake tools like Nessie, the pattern can work a bit better. Below is an implementation of the Apache Airflow DAG shared by Antonio:

 style=

In the end Antonio also mentioned an Apache Iceberg implementation of the pattern prepared by Samuel Redai.

Terraforming hardened Azure Databricks environments

In the last talk of my DataOps section, Filip Rzyszkiewicz shared some lessons learned for securing a Databricks setup on Azure:

Even though I couldn't watch all the sessions in real-time, I'm quite happy to have this online access and take some notes after-the-fact. Hopefully you enjoyed them too and who knows, maybe see you some day at a conference or meetup?


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!