Worth reading for data engineers - part 3

Welcome to the 3rd part of the series with great streaming and project organization blog posts summaries!

State Rebalancing in Structured Streaming by Alex Balikov and Tristen Wentling

The first blog post of the series presents how to deal with infrastructure scaling for stateful streaming applications in Apache Spark Structured Streaming. The authors, Tristen Wentling and Alex Balikov, share the state rebalancing implementation in Apache Spark version on Databricks.

The key takeaways are:

Link to the blog post: https://www.databricks.com/blog/2022/10/04/state-rebalancing-structured-streaming.html.

You can reach out to the authors on LinkedIn: Tristen Wentling and Alexander Balikov.

Building a Data Lake on PB scale with Apache Spark by David Vrba

Even though my current focus are table file formats and I'm trying to keep myself away from the previous era of data lakes built on top of columnar formats such as Apache Parquet or ORC, I appreciated the David Vrba's article about building a scalable data lake on top of Apache Spark and the aforementioned columnar formats.

As for many big data projects, David and his team had to deal with several challenges. I'll focus here only on a few ones but invite you to see the rest in the blog post directly!

Link to the blog post: https://towardsdatascience.com/building-a-data-lake-on-pb-scale-with-apache-spark-1622d7073d46.

You can reach out to the author on LinkedIn: David Vrba.

Apache Kafka: 8 things to check before going live by Aris Koliopoulos

In his 3-years old blog post Aris shares several interesting notes about Apache Kafka. I particularly enjoyed the last 3 ones covering some low-level details:

Link to the blog post: https://ariskk.com/kafka-8-things.

You can reach out to the author on LinkedIn: Aris Kyriakos Koliopoulos.

Designing a Production-Ready Kappa Architecture for Timely Data Stream Processing by Amey Chaugule

Reprocessing (aka backfilling) is not an easy task in streaming systems. It's even more challenging if you think about stateful applications. The blog post I'm sharing here was written by Amey Chaugule almost 3 years ago but despite the age, the solution is very smart and deserves its place in the "Worth a read" series!

To introduce the problem, let me quote Amey from his blog post:


The data which the streaming pipeline produced serves use cases that span dramatically different needs in terms of correctness and latency. Some teams use our sessionizing system on analytics that require second-level latency and prioritize fast calculations. At the other end of the spectrum, teams also leverage this pipeline for use cases that value correctness and completeness of data over a much longer time horizon for month-over-month business analyses as opposed to short-term coverage. We discovered that a stateful streaming pipeline without a robust backfilling strategy is ill-suited for covering such disparate use cases.

First, I thought about backfilling as about fixing invalid data, for example generated by a buggy code. Amey's definition extends that purpose and turns out that "A backfill pipeline is thus not only useful to counter delays, but also to fill minor inconsistencies and holes in data caused by the streaming pipeline".

Therefore, how to backfill those pipelines? Before introducing the implemented solution, Amey reminds 2 well-known approaches:

The solution? Combining both approaches! The implementation considers Hive as an unbounded data source. It addresses 2 main issues. First, it avoids overloading the infrastructure with one huge backfill batch task. And second, it avoids the overhead of creating a temporary Kafka topic to ingest the reprocessed data.

Link to the blog post: https://www.uber.com/en-FR/blog/kappa-architecture-data-stream-processing/.

You can reach out to the author on LinkedIn: Amey Chaugule.

See you next month for the 4th part of the series!


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!