Data validation frameworks - Great Expectations classes

In my previous post I presented a very simplified version of a Great Expectations data validation pipeline. Today, before going further and integrating the pipeline with a data orchestration tool, it's a good moment to see what's inside the framework.

Continue Reading →

What's new in Apache Spark 3.0 - shuffle partitions coalesce

In my previous blog post you could learn about the Adaptive Query Execution improvement added to Apache Spark 3.0. At that moment, you learned only about the general execution flow for the adaptive queries. Today it's time to see one of possible optimizations that can happen at this moment, the shuffle partition coalesce.

Continue Reading →

Global and local Apache ZooKeeper in Apache Pulsar - part 2

In my last post about Apache pulsar, I introduced global and local ZooKeepers. In this one, which is the follow-up, I'll check what both of them contain.

Continue Reading →

What's new in Apache Spark 3.0 - shuffle service changes

One of Apache Spark's components making it hard to scale is shuffle. Fortunately, the community is on a good way to overcome this limitation and the new release of the framework brings some important improvements on this field.

Continue Reading →

Data validation frameworks - introduction to Great Expectations

When I published my blog post about Deequ and Apache Griffin in March 2020, I thought that there was nothing more to do with data validation frameworks. Hopefully, Alexander Wagner pointed me out another framework, Great Expectations that I will discover in the series of 3 blog posts.

Continue Reading →

What's new in Apache Spark 3.0 - Adaptive Query Execution

A query adapting to the data characteristics discovered one-by-one at runtime? Yes, in Apache Spark 3.0 it's possible thanks to the Adaptive Query Execution!

Continue Reading →

Landing zone or direct writes?

I don't know whether it's a good sign or not, but I start having some convictions about building data systems. Of course, building an architecture will always be the story of trade-offs but there are some practices that I tend to prefer than the others. And in this article I will share my thoughts on one of them.

Continue Reading →

What's new in Apache Spark 3.0 - PostgreSQL feature parity

Apart from the date and time management, another big feature of Apache Spark 3.0 is the work on the PostgreSQL feature parity, that will be the topic of my new article from the series.

Continue Reading →

File sink and Out-Of-Memory risk

A few weeks ago I wrote 3 posts about file sink in Structured Streaming. At this time I wasn't aware of one potential issue, namely an Out-Of-Memory problem that at some point will happen.

Continue Reading →

What's new in Apache Spark 3.0 - Apache Kafka integration improvements

After previous presentations of the new date time and functions features in Apache Spark 3.0 it's time to see what's new on the streaming side in Structured Streaming module, and more precisely, on its Apache Kafka integration.

Continue Reading →

Managing different date time formats with DateTimeFormatterBuilder

In one of the homework of my Become a Data Engineer course I'm asking students to normalize a dataset. In the dataset a date time field has different supported formats. When I was analyzing the possible solutions, I found a class I've never met before, the DateTimeFormatterBuilder. And it will be the topic of this post.

Continue Reading →

What's new in Apache Spark 3.0 - binary data source

I remember my first days with Apache Spark and the analysis of available RDD data sources. Since then, I have used a lot of them, except the binary data which is a new implemented part in Apache Spark SQL in the release 3.0.

Continue Reading →

Becoming a data engineer - a feedback of my journey

Recently a reader asked me in a PM about the things to know and to learn before starting to work as a data engineer. Since I think that my point of view may be interesting for more than 1 person (if not, I'm really sorry), I decided to write a few words about it.

Continue Reading →

What's new in Apache Spark 3.0 - new SQL functions

After date time management, it's time to see another important feature of Apache Spark 3.0, he new SQL functions.

Continue Reading →

Ignoring files issues in Apache Spark SQL

I have to consider myself as a lucky guy since I've never had to deal with incorrectly formatted files. However, that's not the case of everyone. Hopefully, Apache Spark comes with few configuration options to manage that.

Continue Reading →