One reason why you can think about using a custom state store is the performance issues, or rather unpredictable execution time due to the shared memory between the default state store implementation and Apache Spark task execution. To overcome that, you can try to switch the state store implementation to an off-heap-based one, like RocksDB.
After the introductory part, it's time to share what I learned from the custom state store implementation.
After previous introductory posts, it's time to deep delve into the state store API and implement our own custom state store.
Since there are already 2 Open Source implementations for RocksDB state store, I decided to use another backend to illustrate how to customize the state store in Structured Streaming. Initially, I wanted to try with Badger which is the store behind DGraph database but didn't find any Java-facing interface and dealing with the Java Native Interface or any other wrapper, was not an option. Fortunately, I ended up by finding MapDB, a Kotlin-based - hence a Java-facing interface - embedded database.
After previous posts about native stateful operations, it's time to focus on the one where you can define your custom stateful logic.
Streaming joins are an interesting feature that heavily uses state store. Even though I already blogged about it in the past (2018), some changes were made and also - I hope so - my explanation capacity improved.
In previous blog posts you discovered how the state store interacts with dropDuplicates and limit operators. This time you will see how it's used in aggregations.
Another stateful operation requiring the state store is drop duplicates. You can use it to deduplicate your streaming data before pushing it to the sink.
It's the second follow-up Data+AI Summit post but the first one focusing on the stateful operations and their interaction with the state store.
The main Apache Spark component enabling stateful processing is StateStoreRDD. It creates a partition-based state store instance but also triggers state-based computation.