Spark has different methods to reduce data loss, also during streaming processing. It proposes well known checkpointing but also less obvious operation invoked on stopping processing - graceful shutdown.
updateStateByKey function, explained in the post about Stateful transformations in Spark Streaming, is not the single solution provided by Spark Streaming to deal with state. Another one, much more optimized, is mapWithState.
One of previous posts talked about checkpoint types in Spark Streaming. This one focuses more on one type of them - metadata checkpoint.
Metadata checkpoint is useful in quickly restoring failing jobs. However, it won't work if the context creation and processing parts aren't declared correctly.
Regarding to batch-oriented processing in Spark, new transformation types in Spark Streaming are based on time periods.
Spark Streaming is able to handle state-based operations, ie. operations containing a state susceptible to be modified in subsequent batches of data.
Checkpoint allows Spark to truncate dependencies on previously computed RDDs. In the case of streams processing their role is extended. In additional, they're not a single method to prevent against failures.
Even if Spark Streaming uses globally the same configuration as batch, there are some of entries specific to streaming.
Standard data sources, such as files, queues or sockets are natively implement in Spark Streaming context. But the framework allows the creation of more flexible data consumers called receivers.
Spark Streaming is not static and allows to convert DStreams to new types. It can be done, exactly as for batch-oriented processing, through transformations.
In Spark batch-oriented, RDD was a data abstraction. In Spark Streaming RDDs are still present but for the programmer another data type is exposed - DStream.
Spark Streaming is a powerful extension of Spark which helps to work with streams efficiently. In this article we'll present basic concepts of this extension.