Become a better Data Engineer with waitingforcode.com

Master Apache Spark Structured Streaming

Context-based learning:

You've joined a news company. The company publishes news on a website and is just in the beginning of their data journey.

So far they've been relying on batch processing to generate insight. They have been using tools like Apache Spark SQL, Apache Airflow, an object store, and a data warehouse.

However, the project requires near real-time processing capabilities in many places. You're the one who will lead this batch-to-streaming transformation!

Your goal is to go through the course and solve each homework exercise with the elements learned so far. By the end of the course your system will become streaming-first and you'll take a few months off followed by a raise, as promised by your Head of Data Engineering 🙂

Have you written your first streaming pipeline and think you know it all?

What I will learn?

  1. Introduction
    • Welcome
    • Streaming processing
    • Streaming history in Apache Spark
  2. Apache Spark Structured Streaming 101
    • Plan
    • A streaming query
    • Shared components with Apache Spark SQL
    • Pure streaming components
    • Anatomy of a query
    • Running a query
    • API
    • Homework
  3. Data sources
    • Plan
    • Apache Kafka
    • Delta Lake
    • Raw file formats
    • Custom source, API-based
    • Homework
  4. Data sinks
    • Plan
    • Apache Kafka
    • Delta Lake
    • Raw file formats
    • ForechBatch
    • ForeachWriter
    • Custom sink, API-based
    • Idempotency 💻
    • Homework
  5. Streaming concepts in depth
    • Plan
    • Data transformations
    • Trigger
    • Checkpoints
    • Homework
  6. Streaming concepts in depth
    • Plan
    • Data transformations
    • Trigger
    • Checkpoints
    • Homework
  7. Stateful processing 101
    • Plan
    • Watermark
    • State store
    • Stateful transformations: Joins, Aggregations, Windows, dropDuplicates, arbitrary stateful processing
    • Homework
  8. Stateful processing in depth
    • Plan
    • Aggregations
    • Windows
    • Deduplication
    • Arbitrary stateful processing
    • Output modes
    • Homework
  9. State store
    • Plan
    • Default
    • RocksDB
    • Fault-tolerance
    • Homework
  10. Scaling
    • Plan
    • Hardware
    • Dynamic Resource Allocation
    • Offset limits
    • Homework
  11. Performance considerations
    • Plan
    • Code
    • Pitfalls
    • Late data
    • Homework
  12. Idempotency
    • Plan
    • Definition
    • Overwrite
    • In stateful processing
    • Homework
  13. Tests
    • Plan
    • Unit tests
    • Streaming mode
    • Integration tests
    • Homework
  14. Operationalization
    • Plan
    • Monitoring
    • Alerting
    • Releases
    • Homework
  15. Reprocessing
    • Plan
    • Restart
    • Batch layer: stateless
    • Batch layer: stateful
    • Homework
  16. Gotchas
    • Plan
    • Cache
    • Joins
    • Processing time
    • Raw data files
    • Heterogeneous logic
    • Homework

Demos and homework exercises implemented with Scala and Python.

Join the waiting list 📨