Data sharing on the cloud

One of the big announcements of the previous Data+AI Summit was Delta Sharing, a protocol to exchange the life data with internal and external users. The question I asked myself at that moment was "Does it exist on the cloud?". Let's see.

Continue Reading →

What's new in Apache Spark 3.2.0 - Data Source V2

Even though Data Source V2 is present in the API for a while, every release brings something new to it. This time too and we'll see what through this blog post!

Continue Reading →

Data orchestration on the cloud

When it comes to executing one isolated job, there are many choices and using a data orchestrator is not always necessary. However, it doesn't apply to the opposite scenario where a data orchestrator not only orchestrates the workload but also provides a monitoring layer. And the question arises, what to do on the cloud?

Continue Reading →

What's new in Apache Spark 3.2.0 - push-based shuffle

In the previous Apache Spark releases you could see many shuffle evolutions such as shuffle files tracking or pluggable storage interface. And the things don't change for 3.2.0 which comes with the push-based merge shuffle.

Continue Reading →

Time travel on the cloud

I've first heard about the time travel feature with Delta Lake. But after digging a bit, I've found that it's not a pure Delta Lake concept! In this blog post I will show you what cloud services implement it too.

Continue Reading →

What's new in Apache Spark 3.2.0 - SQL changes

Apache Spark SQL evolves and with each new release, it gets closer to the ANSI standard. The 3.2.0 release is not different and you can find many ANSI-related changes. But not only and hopefully, you'll discover all this in this blog post which has an unusual form because this time, I won't focus on the implementation details.

Continue Reading →

Data Loss Prevention on the cloud

When I was writing my previous blog post about losing data on the cloud, I wanted to call it "data loss prevention". It happens that this term is currently reserved for a different problem. The problem that I will cover just below.

Continue Reading →

What's new in Apache Spark 3.2.0 - Structured Streaming

After previous blog posts focusing on 2 specific Structured Streaming features, it's time to complete them with a list of other changes made in the 3.2.0 version!

Continue Reading →

Not losing data on the cloud - strategies

Data is a valuable asset and nobody wants to lose it. Unfortunately, it's possible - even with the cloud services. Hopefully, thanks to their features, we can reduce this risk!

Continue Reading →

What's new in Apache Spark 3.2.0 - session windows

Initially I wanted to include the session windows in the blog post about Structured Streaming changes. But I changed my mind when I saw how many things it involves!

Continue Reading →

Less known data services on the cloud

You all certainly heard about EMR, Dabricks, Dataflow, DynamoDB, BigQuery or Cosmos DB. Those are well known data services of AWS, Azure and GCP, but besides them, cloud providers offer some - often lesser known - services to consider in data projects. Let's see some of them in this blog post!

Continue Reading →

What's new in Apache Spark 3.2.0 - RocksDB state store

It's big news for Apache Spark Structured Streaming users. RocksDB is now available as a Vanilla Spark-backed state store backend!

Continue Reading →

Azure Durable Functions

When I first heard about Durable Functions, my reaction was: "So cool! We can now build fully serverless streaming stateful pipelines!". Indeed, we can but it's not their single feature!

Continue Reading →

Stage level scheduling

The idea of writing this blog post came to me when I was analyzing Kubernetes changes in Apache Spark 3.1.1. Starting from this version we can use stage level scheduling, so far available only for YARN. Even though it's probably a very low level feature, it intrigued me enough to write a few words here!

Continue Reading →

What's new on the cloud for data engineers - part 4 (05-08.2021)

It's time for the 4th part of the "What's new on the cloud for data engineers" series. This time I will cover the changes between May and August.

Continue Reading →