distributed stateful processing articles

Stateful transformations with mapGroupsWithState

Streaming stateful processing in Apache Spark evolved a lot from the first versions of the framework. At the beginning was updateStateByKey but some time after, judged inefficient, it was replaced by mapWithState. With the arrival of Structured Streaming the last method was replaced in its turn by mapGroupsWithState.

Continue Reading β†’

Dealing with state lifecycle in Apache Beam

As we saw in the previous post, Apache Beam brings the possibility to deal with state. However, as we learned there, the state itself allows only to keep something in memory during the window duration. After that, the state is removed. But thanks to another Beam's feature called timers we can deal with the expiring state just before its removal from the state store.

Continue Reading β†’

Stateful processing in Apache Beam

Real-time processing is most of the time somehow related to stateful processing. Either we need to solve some sessionization problem, count the number of visitors per minute etc. Not surprisingly Apache Beam comes with the API adapted to put in place the solutions to them.

Continue Reading β†’

Stateful transformations with mapWithState

updateStateByKey function, explained in the post about Stateful transformations in Spark Streaming, is not the single solution provided by Spark Streaming to deal with state. Another one, much more optimized, is mapWithState.

Continue Reading β†’

Stateful transformations in Spark Streaming

Spark Streaming is able to handle state-based operations, ie. operations containing a state susceptible to be modified in subsequent batches of data.

Continue Reading β†’