Master stream processing waitingforcode.com

Become a better Data Engineer with waitingforcode.com

Master Apache Spark Structured Streaming

Context-based learning:

You've joined a news company. The company publishes news on a website and is just in the beginning of their data journey.

So far they've been relying on batch processing to generate insight. They have been using tools like Apache Spark SQL, Apache Airflow, an object store, and a data warehouse.

However, the project requires near real-time processing capabilities in many places. You're the one who will lead this batch-to-streaming transformation!

Your goal is to go through the course and solve each homework exercise with the elements learned so far. By the end of the course your system will become streaming-first and you'll take a few months off followed by a raise, as promised by your Head of Data Engineering 🙂

Have you written your first streaming pipeline and think you know it all?

What I will learn?

Introduction
- Welcome
- Streaming processing
- Streaming history in Apache Spark
Apache Spark Structured Streaming 101
- Plan
- A streaming query
- Shared components with Apache Spark SQL
- Pure streaming components
- Anatomy of a query
- Running a query
- API
- Homework
Data sources
- Plan
- Apache Kafka
- Delta Lake
- Raw file formats
- Custom source, API-based
- Homework
Data sinks
- Plan
- Apache Kafka
- Delta Lake
- Raw file formats
- ForechBatch
- ForeachWriter
- Custom sink, API-based
- Idempotency 💻
- Homework
Streaming concepts in depth
- Plan
- Data transformations
- Trigger
- Checkpoints
- Homework
Streaming concepts in depth
- Plan
- Data transformations
- Trigger
- Checkpoints
- Homework
Stateful processing 101
- Plan
- Watermark
- State store
- Stateful transformations: Joins, Aggregations, Windows, dropDuplicates, arbitrary stateful processing
- Homework
Stateful processing in depth
- Plan
- Aggregations
- Windows
- Deduplication
- Arbitrary stateful processing
- Output modes
- Homework
State store
- Plan
- Default
- RocksDB
- Fault-tolerance
- Homework
Scaling
- Plan
- Hardware
- Dynamic Resource Allocation
- Offset limits
- Homework
Performance considerations
- Plan
- Code
- Pitfalls
- Late data
- Homework
Idempotency
- Plan
- Definition
- Overwrite
- In stateful processing
- Homework
Tests
- Plan
- Unit tests
- Streaming mode
- Integration tests
- Homework
Operationalization
- Plan
- Monitoring
- Alerting
- Releases
- Homework
Reprocessing
- Plan
- Restart
- Batch layer: stateless
- Batch layer: stateful
- Homework
Gotchas
- Plan
- Cache
- Joins
- Processing time
- Raw data files
- Heterogeneous logic
- Homework

Demos and homework exercises implemented with Scala and Python.

Join the waiting list 📨