Broadcast join and changing static dataset

Last year I wrote a blog post about broadcasting in Structured Streaming and I got an interesting question under one of the demo videos. What happens if the joined static dataset in a broadcast mode gets new data? Let's check this out!

Continue Reading →

Serverless MapReduce?

Is it possible to implement the MapReduce paradigm on top of cloud serverless functions? Technically yes and there are some reference architectures I'm gonna discuss in this blog post. Is it a good idea? It depends on the context and hopefully you'll be able to figure out the answer after reading my thoughts.

Continue Reading →

Task retries in Apache Spark Structured Streaming

Unexpected things happen and sooner or later, any pipeline can fail. Hopefully, sometimes the errors may be temporary and automatically recovered after some retries. What if the job is a streaming one? Let's see here how Apache Spark Structured Streaming handles task retries in micro-batch and continuous modes!

Continue Reading →

Reverse ETL

The first "reverse" term I've ever encountered in programming was reverse proxy. Since then, I've seen passing "reverse engineering", "reverse iterator", but none of them was a pure data term. Until recently, when I heard about reverse ETL.

Continue Reading →

Kubernetes concepts for Apache Spark

I had the idea for this blog post when I was preparing the "What's new in Apache Spark..." series. At that time, I was writing about Kubernetes in the context of Apache Spark but needed to "google" a lot of things aside - mostly the Kubernetes API terms.

Continue Reading →

What's new on the cloud for data engineers - part 5 (09-12.2021)

It's time for the 5th part of the "What's new on the cloud for data engineers" series. This time I will cover the changes between September and December.

Continue Reading →

Distinct vs group by key difference

I've heard an opinion that using DISTINCT can have a negative impact on big data workloads, and that the queries with GROUP BY were more performant. Is it true for Apache Spark SQL?

Continue Reading →

Retrospective: 2021 on waitingforcode.com

2021 comes to the end and as last year, it's a great moment to summarize what happened and what will happen in 2022!

Continue Reading →

Schema management in cloud streaming services

When I tell you "schema management" and "streaming", you'll certainly think about the schema registry of Apache Kafka. That's true but also streaming cloud services do manage the schemas and in this blog post we'll see how.

Continue Reading →

What's new in Apache Spark 3.2.0 - miscellaneous changes

My Apache Spark 3.2.0 comes to its end. Today I'll focus on the miscellaneous changes, so all the improvements I couldn't categorize in the previous blog posts.

Continue Reading →

Testing streaming data systems on the cloud - ideas

That's one of the biggest problems I've faced in my whole career. The development environment! I'm not talking here about creating cloud resources in different subscription but about the environment sharing similar characteristics to the production. In the blog post I'll share with you different strategies to put in place in the context of the cloud and streaming applications.

Continue Reading →

What's new in Apache Spark 3.2.0 - Apache Parquet and Apache Avro improvements

I still have 2 topics remaining in my "What's new..." backlog. I'd like to share the first of them with you today, and see what changed for Apache Parquet and Apache Avro data sources.

Continue Reading →

Scaling data processing on the cloud

Processing static datasets is easier than dynamic ones that may change in time. Hopefully, cloud services offer various more and less manual features to scale the data processing logic. We'll see some of them in this blog post.

Continue Reading →

What's new in Apache Spark 3.2.0 - performance optimizations

Apache Spark 3.0 extended the static execution engine with a runtime optimization engine called Adaptive Query Execution. It has changed a lot since the very first release and so even in the most recent version! But AQE is not a single performance improvement and I hope you'll see this in the blog post!

Continue Reading →

Hush! It's a secret on the cloud

How to manage secrets is probably one of the first problems you may encounter while deploying some resources from a CI/CD pipeline. The simple answer is: not manage them at all! Let the cloud services do this.

Continue Reading →