Introduction to Spark Streaming

Versions: Spark 2.0.0

Spark Streaming is a powerful extension of Spark which helps to work with streams efficiently. In this article we'll present basic concepts of this extension.

The first part presents shared concepts between Spark and Spark Streaming. The second part describes some specific features of Spark Streaming. The goal of this article is to provide a global image of concepts used in Spark Streaming and not to explain them exactly. The explanation will be the topic of other posts.

Common concepts

Basically the logic is the same. The application is executed on a driver and its tasks are further dispatched to executors. Another common point is data representation. In Spark Streaming it's based on a structure called DStream which is a sequence of RDDs. The same RDD is the main component holding data in Spark. Also, both frameworks are based on the same concept of lazy-evaluated transformations, really executed only on action call. Note at this occasion that in Spark Streaming actions are often called output operations.

We can also find similar transformations in both frameworks. Since DStream relies on RDD, all transformations shared with RDD use the same methods (= they're executed on underlied RDDs). For example, map and filter transformations look like:

override def compute(validTime: Time): Option[RDD[U]] = {
  parent.getOrCompute(validTime).map(_.map[U](mapFunc))
}

override def compute(validTime: Time): Option[RDD[T]] = {
  parent.getOrCompute(validTime).map(_.filter(filterFunc))
}

The difference concerns only transformations reserved to DStream, such as window or state-related ones.

Programmatically, Spark Streaming depends on Spark. Thanks to that we can build Spark Streaming applications with the context created for Spark. Both share also other concepts, as checkpointing or caching. The only difference for the caching is default level. For Spark it was MEMORY_ONLY while for Spark Streaming is MEMORY_ONLY_SER.

Differences

Spark Streaming is built on Spark. But both have different purposes. Spark is great to process static data in a batch mode, for example every hour. Spark Streaming has different goal. It's destined to work on real-time data streams. It means that Spark Streaming will read incoming data as soon as it enters to the system. It can prove to be useful for all kind of real-time analytic like tracking user navigation or clicked ads.

Because of different use cases, Spark Streaming contains supplementary features. The first one are windowed computations. Thanks to it defined computations can apply to a sliding period of time. It means that during a single window one or more DStreams are treated together. Another feature concerns state tracking. Thanks to it we can handle state changes of particular object at the receiving of each new DStream. With this feature we could, for example, make a counter for the number of visited pages of given user identified by his session ID.

Another concept proper to Spark Streaming is the idea of receiver. In Spark data was loaded once, either from file or from Java objects. Since Spark Streaming has to deal with real-time data which can be produced at every moment, it has to be able to consume it continuously. This consumption is the role of receiver which connects to external datasource (as Kafka, RabbitMQ) and listens to new data.

This post introduces the main topic of Spark Streaming. The first part shows that it shares some basic concepts with Spark, such as RDD, driver-executor pattern, caching and checkpointing. But there are some new features, such as windowed computations, state tracking or receiver.

If you liked it, you should read:

The comments are moderated. I publish them when I answer, so don't worry if you don't see yours immediately :)

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!