Spark Streaming configuration

Versions: Spark 2.0.0

Even if Spark Streaming uses globally the same configuration as batch, there are some of entries specific to streaming.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

The first part of this article describes the configuration of backpressure feature introduced in Spark 1.5. The second section presents configuration for driver. The last part will be dedicated to the configuration of receivers.

Backpressure configuration

Before talking about backpressure configuration, let's explain what does it mean. Simply speaking, backpressure is a protection against sudden and great increase of data to process. It can guarantee that the application receives data only as fast as it can process it.

In Spark Streaming it consists on defining the number of records per second received by each receiver. They're defined internally by Spark. To determine that, it relies on stats of previously computed batches. More exactly, RateController is a listener listening for batch completion events. On every received event the listener updated backpressure rate with the help of RealEstimator.

Backpressure control must be activated in the configuration through spark.streaming.backpressure.enabled entry (defaults to false). The rate is calculated on previously executed batch tasks. The initial value can be set through spark.streaming.backpressure.initialRate property.

Backpressure mechanism is upper bounded (backpressure ratio can't be greater) by spark.streaming.receiver.maxRate property, described in the last section, and by spark.streaming.kafka.maxRatePerPartition is Kafka is used as data source.

Driver configuration

Driver configuration takes mostly the same parameters as in the case of batch processing. In additional, it enforces fault-tolerance configuration with checkpoint directory setting (spark.streaming.checkpoint.directory) and Write Ahead Logs (WAL) location (spark.streaming.driver.writeAheadLog.closeFileAfterWrite). The second property take true or false value and tells if WAL file should be closed after writing log record. It should be set to true if file system doesn't support flushing (for example: S3).

Other properties related to the driver are also associated with WAL. However, they are not exposed publicly in Spark Streaming configuration page. Among these properties we can distinguish: spark.streaming.driver.writeAheadLog.allowBatching, spark.streaming.driver.writeAheadLog.rollingIntervalSecs, spark.streaming.driver.writeAheadLog.class, spark.streaming.driver.writeAheadLog.maxFailures or spark.streaming.driver.writeAheadLog.batchingTimeout.

Receiver configuration

Richer than driver's configuration is the configuration of receiver. The first property spark.streaming.blockInterval indicates the interval at which received data is chunked into blocks before it sends to Spark.

Other, and already mentioned entry, spark.streaming.receiver.maxRate defines the maximum number of records processed by each receivers in 1 second. It can be particularly useful when cluster resources don't fit with data ingestion frequency.

Two other properties concern WAL. The first one enables them on receiver's side (spark.streaming.receiver.writeAheadLog.enable). The second (spark.streaming.receiver.writeAheadLog.closeFileAfterWrite) tells if file should be closed after writing, exactly as in the already described case of driver.

This post presents some main configuration entries in Spark Streaming. The first part describes the configuration of a feature called backpressure, influencing the rate of processed data. The second part presented configuration available for driver and the last for receiver.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects