Data processing articles

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

Ignoring files issues in Apache Spark SQL

I have to consider myself as a lucky guy since I've never had to deal with incorrectly formatted files. However, that's not the case of everyone. Hopefully, Apache Spark comes with few configuration options to manage that.

Continue Reading β†’

What's new in Apache Spark 3.0 - Proleptic Calendar and date time management

When I was writing my blog post about datetime conversion in Apache Spark 2.4, I wanted to check something on Apache Spark's Github. To my surprise, the code had nothing in common with the code I was analyzing locally. And that's how I discovered the first change in Apache Spark 3.0. The first among few others that I will cover in a new series "What's new in Apache Spark 3.0".

Continue Reading β†’

Structured Streaming file sink and reprocessing

I presented in my previous posts how to use a file sink in Structured Streaming. I focused there on the internal execution and its use in the context of data reprocessing. In this post I will address a few of the previously described points.

Continue Reading β†’

Unions in Apache Spark SQL

You have 2 different datasets and want to process them as a single unit? Maybe you have some legacy data that you need to process alongside the brand new dataset? JOIN is not an option because the goal is to build a single processing unit and not combine the rows. UNION operation can be a good fit for that.

Continue Reading β†’

File sink and manifest compaction

In my previous post I introduced the file sink in Apache Spark Structured Streaming. Today it's time to focus on an important concept of this output format which is the manifest file lifecycle.

Continue Reading β†’

File sink in Apache Spark Structured Streaming

One of the homework tasks of my Become a Data Engineer course is about synchronizing streaming data with a file system storage. When I was trying to implement this part, I found a manifest-based file stream that I will explore in this and next blog posts.

Continue Reading β†’

Idempotent logic for stateful processing and late data

Sometimes I come back to the topics I already covered, often because by mistake I discover something new that can improve them. And that's the case for my today's article about idempotence in stateful processing.

Continue Reading β†’

Apache Spark SQL partitionBy - shuffle or not to shuffle?

I remember my first time with partitionBy method. I was reading data from an Apache Kafka topic and writing it into hourly-based partitioned directories. To my surprise, Apache Spark was generating always 1 file and my first thought... oh, it's shuffling the data. But I was wrong and in this post will explain why.

Continue Reading β†’

Idempotent file generation in Apache Spark SQL

Some time ago I was thinking how to partition the data and ensure that we can reprocess it easily. Overwrite mode was not an option since the data of one partition could be generated by 2 different batch executions. That's why I started to think about implementing an idempotent file output generator and, therefore, discover file sink internals in practice.

Continue Reading β†’

Watermarks and not grouped query - why they don't work

Several weeks ago I played with watermark, just to recall some concepts. I wrote a query and...the watermark didn't work! Of course, my query was wrong but this intrigued me enough to write this short article.

Continue Reading β†’

Nested fields, dropDuplicates and watermark in Apache Spark Structured Streaming

When I was playing with my data-generator and Apache Spark Structured Streaming, I was surprised by one behavior that I would like to share and explain in this post. To not deep delve into the details right now, the story will be about the use of nested structures in several operations.

Continue Reading β†’

Two topics, two schemas, one subscription in Apache Spark Structured Streaming

After my January's talk about Apache Kafka integration in Structured Streaming I got an interesting question on off. The question was, how to process 2 topics simultaneously with Structured Streaming? The "small" problem was the fact that both had different schemas.

Continue Reading β†’

Apache Spark's _SUCESS anatomy

_SUCCESS file generated by Apache Spark SQL when you successfully generate a dataset, is often a big question for newcomers. Why does the framework need this file? How is it generated? I will cover these aspects in this article.

Continue Reading β†’

sortWithinPartitions in Apache Spark SQL

Few weeks ago when I was preparing a talk for one local meetup, I wanted to list the most common operations we can do with Spark for the newcomers. And I found one I haven't used before, namely sortWithinPartitions.

Continue Reading β†’

Corrupted records aka poison pill records in Apache Spark Structured Streaming

Some time ago I watched an interesting Devoxx France 2019 talk about poison pills in streaming systems presented by LoΓ―c Divad. I learned a few interesting patterns like sentinel value that may help to deal with corrupted data but the talk was oriented on Kafka Streams. And since I didn't find a corresponding resource for Apache Spark Structured Streaming [and also because I'm simply an Apache Spark enthusiast ;)], I decided to write one trying to implement LoΓ―c's ideas in the Structured Streaming world.

Continue Reading β†’

Reorder JOIN optimizer - star schema

I didn't know that join reordering is quite interesting, though complex, topic in Apache Spark SQL. The queries not only can be transformed into the ones using JOIN ... ON clauses. They can also be reordered accordingly to the star schema which we'll try to see in this post.

Continue Reading β†’

Reorder JOIN optimizer - cost-based optimization

In my previous post I explained how Apache Spark can reorder JOINs based on the logical plan. Today I'll focus on another aspect of reordering which uses cost estimation for the proposed plans.

Continue Reading β†’

Reorder JOIN optimizer

One of the reasons why I like my blogging activity is that from time to time the exchange is bidirectional. It happens mostly on Github but also on the comments under the post and I appreciate the situation when I don't know the answer and must dig a little to explain it in a blog post :) I wrote this one thanks to bithw1 issue created on my Spark playground repository (thank you for another interesting question btw :)).

Continue Reading β†’

Apache Kafka source in Structured Streaming - "beyond the offsets"

Even though I've already written a few posts about Apache Kafka as a data source in Apache Spark Structured Streaming, I still had some questions in my head. In this post I will try to answer them and let this Kafka integration in Spark topic for investigation later.

Continue Reading β†’

Docker images and Apache Spark applications

Containers are with us, data engineers, for several years. The concept was already introduced on YARN but the technology that really made them popular was Docker. In this post I will focus on its recommended practices to make our Apache Spark images better.

Continue Reading β†’