Data+AI Summit Europe 2020 articles articles

Data+AI Summit follow-up post: Why RocksDB rocks?

One reason why you can think about using a custom state store is the performance issues, or rather unpredictable execution time due to the shared memory between the default state store implementation and Apache Spark task execution. To overcome that, you can try to switch the state store implementation to an off-heap-based one, like RocksDB.

Continue Reading β†’

Data+AI Summit: custom state store integration feedback

After the introductory part, it's time to share what I learned from the custom state store implementation.

Continue Reading β†’

Data+AI Summit: Custom state store - API

After previous introductory posts, it's time to deep delve into the state store API and implement our own custom state store.

Continue Reading β†’

Data+AI follow-up: Introduction to MapDB

Since there are already 2 Open Source implementations for RocksDB state store, I decided to use another backend to illustrate how to customize the state store in Structured Streaming. Initially, I wanted to try with Badger which is the store behind DGraph database but didn't find any Java-facing interface and dealing with the Java Native Interface or any other wrapper, was not an option. Fortunately, I ended up by finding MapDB, a Kotlin-based - hence a Java-facing interface - embedded database.

Continue Reading β†’

Data+AI Summit follow-up: arbitrary stateful processing and state management

After previous posts about native stateful operations, it's time to focus on the one where you can define your custom stateful logic.

Continue Reading β†’

Data+AI Summit follow-up: joins and state management

Streaming joins are an interesting feature that heavily uses state store. Even though I already blogged about it in the past (2018), some changes were made and also - I hope so - my explanation capacity improved.

Continue Reading β†’

Data+AI Summit follow-up: aggregations and state management

In previous blog posts you discovered how the state store interacts with dropDuplicates and limit operators. This time you will see how it's used in aggregations.

Continue Reading β†’

Data+AI Summit follow-up: drop duplicates and state management

Another stateful operation requiring the state store is drop duplicates. You can use it to deduplicate your streaming data before pushing it to the sink.

Continue Reading β†’

Data+AI Summit follow-up: global limit and state management

It's the second follow-up Data+AI Summit post but the first one focusing on the stateful operations and their interaction with the state store.

Continue Reading β†’

Data+AI follow-up: StateStoreRDD - building block for stateful processing

The main Apache Spark component enabling stateful processing is StateStoreRDD. It creates a partition-based state store instance but also triggers state-based computation.

Continue Reading β†’