What's new on the cloud for data engineers - part 11 (06-09.2023)

It's time for another part of "What's new on the cloud for data engineers". Let's see what happened in the last 4 months.

Continue Reading β†’

Apache Flink best practices - Flink Forward lessons learned

I won't hide it, I'm still a fresher in the Apache Flink world and despite my past streaming experiences with Apache Spark Structured Streaming and GCP Dataflow, I need to learn. And to learn a new tool or concept, there is nothing better than watching some conference talks!

Continue Reading β†’

ETL vs. ELT?

In our social media and marketing-driven era, it's quite hard to get things right. For me there is one common misconception brought by the Modern Data Stack idea that everything should be now ELT. In fact no, it shouldn't but only can.

Continue Reading β†’

Table file formats - isolation levels: Delta Lake

If Delta Lake implemented the commits only, I could stop exploring this transactional part after the previous article. But as for RDBMS, Delta Lake implements other ACID-related concepts. One of these are isolation levels.

Continue Reading β†’

Table file formats - commits: Delta Lake

One of the great features of modern table file formats is the ability to handle write conflicts. It wouldn't be possible without commits that are the topic of this new blog post.

Continue Reading β†’

Don't sleep when you code...about sleep issue in KPL

Lessons learned why it's always worth checking the code implementation to avoid surprises later. Even for vendor-supported solutions.

Continue Reading β†’

_spark_metadata in Apache Spark Structured Streaming issue is no more!

There are probably not that many people working today on the flat files with Structured Streaming than 5 years ago thanks to the table file formats. However, if you are in this group and are still generating CSVs or JSONs with the streaming sink, brace yourself, the memory problems are coming if you don't take action!

Continue Reading β†’

The first state in Apache Spark Structured Streaming arbitrary stateful processing

When you wrote your first arbitrary stateful processing pipelines, the state expiration is maybe the first tricky point you had to deal with. Why is that? After all, it's just about setting the timeout, doesn't it? Most of the time, yes, but there is an exception.

Continue Reading β†’

State expiration in stream-to-stream joins with event time range condition

You certainly know it, the watermark (aka GC Watermark) is responsible for cleaning state store in Apache Spark Structured Streaming. But you may not know that it's not the single time-based condition. There is a different one involved in the stream-to-stream joins.

Continue Reading β†’

How to initialize state in Apache Spark Structured Streaming stateful jobs?

Starting from Apache Spark 3.2.0 is now possible to load an initial state of the arbitrary stateful pipelines. Even though the feature is easy to implement, it hides some interesting implementation details!

Continue Reading β†’

Berlin Buzzwords 2023 - notes for data engineers

That's the conference I've heard only recently about. What a huge mistake! Despite the lack of "data" word in the name, it covers many interesting data topics and before I share with you my notes from this year's Data+AI Summit, let me do the same for Berlin Buzzwords!

Continue Reading β†’

Multiple queries running in Apache Spark Structured Streaming

That's often a dilemma, whether we should put multiple sinks working on the same data source in the same or in different Apache Spark Structured Streaming applications? Both solutions may be valid depending on your use case but let's focus here on the former one including multiple sinks together.

Continue Reading β†’

Yes, I'm learning Apache Flink - beginner's problems

Surprised? You shouldn't. I've always been eager to learn, including 5 years ago when for the first time, I left my Apache Spark comfort zone to explore Apache Beam. Since then I had a chance to write some Dataflow streaming pipelines to fully appreciate this technology and work on AWS, GCP, and Azure. But there is some excitement for learning-from scratch I miss.

Continue Reading β†’

What's new in Apache Spark 3.4.0 - shuffle changes

Shuffle is a permanent point in the What's new in Apache Spark series. Why? It's often one the most time consuming part of the jobs and knowing the improvement simply helps writing better pipelines.

Continue Reading β†’

What's new in Apache Spark 3.4.0 - Spark Connect

Spark Connect is probably the most expected feature in Apache Spark 3.4.0. It was announced in the Data+AI Summit 2022 keynotes and has a lot of coverage in social media right now. I'll try to add my small contribution to this by showing some implementation details.

Continue Reading β†’