What's new on the cloud for data engineers - part 2 (11.2020-01.2021)

It's time for the second update with the news on cloud data services. This time too, a lot of things happened!

Continue Reading →

Data+AI Summit: Custom state store - API

After previous introductory posts, it's time to deep delve into the state store API and implement our own custom state store.

Continue Reading →

Data+AI follow-up: Introduction to MapDB

Since there are already 2 Open Source implementations for RocksDB state store, I decided to use another backend to illustrate how to customize the state store in Structured Streaming. Initially, I wanted to try with Badger which is the store behind DGraph database but didn't find any Java-facing interface and dealing with the Java Native Interface or any other wrapper, was not an option. Fortunately, I ended up by finding MapDB, a Kotlin-based - hence a Java-facing interface - embedded database.

Continue Reading →

PySpark schema inference and 'Can not infer schema for type str' error

The title of this blog post is maybe one of the first problems you may encounter with PySpark (it was mine). Even though it's quite mysterious, it makes sense if you take a look at the root cause.

Continue Reading →

Structured Streaming and temporary views

I don't know you, but me, when I first saw the code with createTempView method, I thought it created a temporary table in the metastore. But it's not true and in this blog post, you will see why.

Continue Reading →

Data+AI Summit follow-up: arbitrary stateful processing and state management

After previous posts about native stateful operations, it's time to focus on the one where you can define your custom stateful logic.

Continue Reading →

Data Vault 2.0 and Big Data

In the previous blog post you discovered the first version of Data Vault methodology. But since the very first iteration, the specification evolved and a few years ago a version 2 was proposed. More adapted to the Big Data world, with several deprecation notes, and more examples adapted to the constantly evolving data world.

Continue Reading →

Data+AI Summit follow-up: joins and state management

Streaming joins are an interesting feature that heavily uses state store. Even though I already blogged about it in the past (2018), some changes were made and also - I hope so - my explanation capacity improved.

Continue Reading →

2020 on waitingforcode.com

It's the second summary blog post after the Let's celebrate - the 600th post is coming one published last year. As previously, it's time to summarize last year and share my plans for 2021.

Continue Reading →

Lakehouse and BigQuery?

You know me already, I'm a big fan of Apache Spark but also of all kinds of patterns. And one of the patterns that nowadays gains in popularity is lakehouse. Most of the time (always?), this pattern is implemented on top of an ACID-compatible file system like Apache Hudi, Apache Iceberg or Delta Lake. But can we do it differently and use another storage, like BigQuery?

Continue Reading →

Watermark and window-based processing

One of the not obvious things about the watermark is how it applies on the windows. At first glance, you could think that it will filter out the records produced before the watermark value. But it's not how it works for windows.

Continue Reading →

What cloud features for data processing patterns (ETL/ELT)?

During my study of BigQuery I found an ETL pattern called feedback loop. Since I never heard about it before, I decided to spend some time and search for other ETL patterns and the cloud features we could use in them.

Continue Reading →

Apache Spark and shuffle management - external services

Shuffle accompanies distributed data processing from the very beginning. Apache Spark is not an exception, and one of the prominent features targeted for 3.1 release is the full support for the pluggable shuffle backend. But it's not the single effort made these days by the community to handle shuffle drawbacks. And you will see it in this blog post.

Continue Reading →

Good books to read for data engineers

Few weeks ago I got a comment asking me about the recommended data engineering books. I mentioned few of them in Becoming a data engineer - a feedback of my journey blog post but without explaining why. I will try to complete that in this blog post then.

Continue Reading →

Data+AI Summit follow-up: aggregations and state management

In previous blog posts you discovered how the state store interacts with dropDuplicates and limit operators. This time you will see how it's used in aggregations.

Continue Reading →