What's new on the cloud for data engineers - part 6 (01-04.2022)

It's time for the first cloud news blog post this year. The update summary lists all changes of data or data-related services between January 1 and April 25.

Continue Reading →

PySpark and the JVM - introduction, part 1

In my quest for understanding PySpark better, the JVM in the Python world is the must-have stop. In this first blog post I'll focus on Py4J project and its usage in PySpark.

Continue Reading →

HTTP-based data ingestion to streaming brokers

Data ingestion is the starting point for all data systems. It can work in batch or streaming mode. I've recently covered the batch ingestion pretty much already with previous blog posts but I haven't done anything for the streaming, yet. Until today when you can read a few words about HTTP-based data ingestion to cloud streaming brokers.

Continue Reading →

Tables and Apache Spark

If you're like me and haven't had an opportunity to work with Spark on Hive, you're probably as confused as I had been about the tables. Hopefully, after reading this blog post you will understand that concept better!

Continue Reading →

Data migration on the cloud

Data is a live being. It's getting queried, written, overwritten, backfilled and ... migrated. Since the last point is the least obvious from the list, I've recently spent some time trying to understand it better in the context of the cloud.

Continue Reading →

Pluggable Catalog API

Despite working with Apache Spark for a while, I still have some undiscovered components. One of them crossed my path while I was writing the first blog post from the ACID file formats series. The lucky one is the Catalog API.

Continue Reading →

Database management services on the cloud

Data migration is one of the scenarios you can face as a data engineer. It's not always an easy task but managed cloud services can help you to put in place the pipeline and solve many common problems.

Continue Reading →

Beware of .withColumn

The .withColumn function is apparently an inoffensive operation, just a way to add or change a column. True, but also hides some points that can even lead to the memory issues and we'll see them in this blog post.

Continue Reading →

ACID file formats - file system layout

Last week I presented the API of the 3 analyzed ACID file formats. Under-the-hood, they obviously generate data files but not only. And that's something we'll focus on in this blog post.

Continue Reading →

Integration tests and Structured Streaming

Unit tests are the backbone of modern software but they only verify a particular unit of the application. What to do if we wanted to check the interaction between all these units? One of the solutions are automated integration tests. While they are relatively easy to implement against data in-rest, they are more challenging for streaming scenarios.

Continue Reading →

ACID file formats - API

It's time to start a new series on the blog! I hope to catch on to the ACID file formats that are gaining more and more importance. It's also a good occasion to test a new learning method. Instead of writing one blog post per feature and format, I'll try to compare Delta Lake, Apache Iceberg, and Apache Hudi concepts in the same article. Besides this personal challenge, I hope you'll enjoy the series and also learn something interesting!

Continue Reading →

Shuffle configuration demystified - part 3

It's time for the last part of the shuffle configuration overview. This time you'll see the properties related to the shuffle service, reducer, I/O, and a few others.

Continue Reading →

Data ingestion to the cloud object store

The volume of the data to migrate from an on-premise to a cloud environment will probably be less significant than previous years since a lot of organizations are already on the cloud. However, it's interesting to see different methods to bring the data there and that's something I'll show you in this blog post.

Continue Reading →

Shuffle configuration demystified - part 2

It's time for the 2 of 3 parts dedicated to the shuffle configuration in Apache Spark.

Continue Reading →

Data catalog services

Writing data processing jobs is a fascinating task. But it can't be worthless if the users can't find and use the generated data. Fortunately, we can count on data catalogs and leverage the power of metadata to overcome this discoverability issue.

Continue Reading →