Data+AI Summit Europe 2020 articles articles

4-day workshop Β· In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that β€” unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants Β· production-ready templates
See the full curriculum β†’ €7,000 flat fee Β· cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

Data+AI follow-up: StateStoreRDD - building block for stateful processing

The main Apache Spark component enabling stateful processing is StateStoreRDD. It creates a partition-based state store instance but also triggers state-based computation.

Continue Reading β†’

Data+AI Summit follow-up: global limit and state management

It's the second follow-up Data+AI Summit post but the first one focusing on the stateful operations and their interaction with the state store.

Continue Reading β†’

Data+AI Summit follow-up: drop duplicates and state management

Another stateful operation requiring the state store is drop duplicates. You can use it to deduplicate your streaming data before pushing it to the sink.

Continue Reading β†’

Data+AI Summit follow-up: aggregations and state management

In previous blog posts you discovered how the state store interacts with dropDuplicates and limit operators. This time you will see how it's used in aggregations.

Continue Reading β†’

Data+AI Summit follow-up: joins and state management

Streaming joins are an interesting feature that heavily uses state store. Even though I already blogged about it in the past (2018), some changes were made and also - I hope so - my explanation capacity improved.

Continue Reading β†’

Data+AI Summit follow-up: arbitrary stateful processing and state management

After previous posts about native stateful operations, it's time to focus on the one where you can define your custom stateful logic.

Continue Reading β†’

Data+AI follow-up: Introduction to MapDB

Since there are already 2 Open Source implementations for RocksDB state store, I decided to use another backend to illustrate how to customize the state store in Structured Streaming. Initially, I wanted to try with Badger which is the store behind DGraph database but didn't find any Java-facing interface and dealing with the Java Native Interface or any other wrapper, was not an option. Fortunately, I ended up by finding MapDB, a Kotlin-based - hence a Java-facing interface - embedded database.

Continue Reading β†’

Data+AI Summit: Custom state store - API

After previous introductory posts, it's time to deep delve into the state store API and implement our own custom state store.

Continue Reading β†’

Data+AI Summit: custom state store integration feedback

After the introductory part, it's time to share what I learned from the custom state store implementation.

Continue Reading β†’

Data+AI Summit follow-up post: Why RocksDB rocks?

One reason why you can think about using a custom state store is the performance issues, or rather unpredictable execution time due to the shared memory between the default state store implementation and Apache Spark task execution. To overcome that, you can try to switch the state store implementation to an off-heap-based one, like RocksDB.

Continue Reading β†’