The "vertical scaling" has caught my attention a few times already when I have been reading about cloud updates. I've always considered horizontal scaling as the single true scaling policy for elastic data processing pipelines. Have I been wrong?
In March I wrote a blog showing how to use accumulators to know the application of each filter statement. Turns out, the solution may not be perfect as mentioned by Aravind in one of the comments. I bet you already have an idea but if not, keep reading. Everything will be clear in the end!
Even though you may be thinking now about Data+AI Summit 2024, I still owe you my retrospective for the 2023 edition. Let's start with the first part covering stream processing talks!
Have you written your first successful Apache Flink job and are still wondering the high-level API translates into the executable details? I did and decided to answer the question in the new blog post.
Checkpoints are a well-known fault-tolerance mechanism in stream processing. But what does it have to do with Delta Lake?
Watermark, or rather multiple watermarks management, has been a thorn in the side of Apache Spark Structured Streaming. It has improved in the previous release (3.4.0) but still had some room for improvement. Well, it did have because the 3.5.0 release brought a serious fix for the multiple watermarks scenario.
It's time to start the series covering Apache Spark 3.5.0 features. As the first topic I'm going to cover Structured Streaming which has got a lot of RocksDB improvements and some major API changes.
I've already written about watermarks in a few places in the blog but despite that, I still find things to refresh. One of them is the watermark used to filter out the late data, which will be the topic of this blog post.
If you have some experience with RDBMS, who doesn't btw, you have probably run a VACUUM command to reclaim the storage space occupied by deleted or obsolete rows. If you're now working with Delta Lake, you can do the same!
Do not get the title wrong! Having applyInPandasWithState in the PySpark API is huge! However, due to Python duck typing, some operations are more difficult and more risky to express in the code than in the strongly typed Scala API.
It's always a huge pleasure to see the PySpark API covering more and more Scala API features. Starting from Apache Spark 3.4.0 you can even write arbitrary stateful processing jobs! But since the API is a little bit different than the one available on the Scala side, I wanted to take a deeper look.
It's time for another part of "What's new on the cloud for data engineers". Let's see what happened in the last 4 months.
I won't hide it, I'm still a fresher in the Apache Flink world and despite my past streaming experiences with Apache Spark Structured Streaming and GCP Dataflow, I need to learn. And to learn a new tool or concept, there is nothing better than watching some conference talks!
In our social media and marketing-driven era, it's quite hard to get things right. For me there is one common misconception brought by the Modern Data Stack idea that everything should be now ELT. In fact no, it shouldn't but only can.
If Delta Lake implemented the commits only, I could stop exploring this transactional part after the previous article. But as for RDBMS, Delta Lake implements other ACID-related concepts. One of these are isolation levels.