Looking for something else? Check the categories of Data processing:
Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark
If not, below you can find all articles belonging to Data processing.
In my previous blog post you could learn about the Adaptive Query Execution improvement added to Apache Spark 3.0. At that moment, you learned only about the general execution flow for the adaptive queries. Today it's time to see one of possible optimizations that can happen at this moment, the shuffle partition coalesce.
One of Apache Spark's components making it hard to scale is shuffle. Fortunately, the community is on a good way to overcome this limitation and the new release of the framework brings some important improvements on this field.
A query adapting to the data characteristics discovered one-by-one at runtime? Yes, in Apache Spark 3.0 it's possible thanks to the Adaptive Query Execution!
Apart from the date and time management, another big feature of Apache Spark 3.0 is the work on the PostgreSQL feature parity, that will be the topic of my new article from the series.
A few weeks ago I wrote 3 posts about file sink in Structured Streaming. At this time I wasn't aware of one potential issue, namely an Out-Of-Memory problem that at some point will happen.
After previous presentations of the new date time and functions features in Apache Spark 3.0 it's time to see what's new on the streaming side in Structured Streaming module, and more precisely, on its Apache Kafka integration.
I remember my first days with Apache Spark and the analysis of available RDD data sources. Since then, I have used a lot of them, except the binary data which is a new implemented part in Apache Spark SQL in the release 3.0.
After date time management, it's time to see another important feature of Apache Spark 3.0, he new SQL functions.
I have to consider myself as a lucky guy since I've never had to deal with incorrectly formatted files. However, that's not the case of everyone. Hopefully, Apache Spark comes with few configuration options to manage that.
When I was writing my blog post about datetime conversion in Apache Spark 2.4, I wanted to check something on Apache Spark's Github. To my surprise, the code had nothing in common with the code I was analyzing locally. And that's how I discovered the first change in Apache Spark 3.0. The first among few others that I will cover in a new series "What's new in Apache Spark 3.0".
I presented in my previous posts how to use a file sink in Structured Streaming. I focused there on the internal execution and its use in the context of data reprocessing. In this post I will address a few of the previously described points.
You have 2 different datasets and want to process them as a single unit? Maybe you have some legacy data that you need to process alongside the brand new dataset? JOIN is not an option because the goal is to build a single processing unit and not combine the rows. UNION operation can be a good fit for that.
In my previous post I introduced the file sink in Apache Spark Structured Streaming. Today it's time to focus on an important concept of this output format which is the manifest file lifecycle.
One of the homework tasks of my Become a Data Engineer course is about synchronizing streaming data with a file system storage. When I was trying to implement this part, I found a manifest-based file stream that I will explore in this and next blog posts.
Sometimes I come back to the topics I already covered, often because by mistake I discover something new that can improve them. And that's the case for my today's article about idempotence in stateful processing.
I remember my first time with partitionBy method. I was reading data from an Apache Kafka topic and writing it into hourly-based partitioned directories. To my surprise, Apache Spark was generating always 1 file and my first thought... oh, it's shuffling the data. But I was wrong and in this post will explain why.
Some time ago I was thinking how to partition the data and ensure that we can reprocess it easily. Overwrite mode was not an option since the data of one partition could be generated by 2 different batch executions. That's why I started to think about implementing an idempotent file output generator and, therefore, discover file sink internals in practice.
Several weeks ago I played with watermark, just to recall some concepts. I wrote a query and...the watermark didn't work! Of course, my query was wrong but this intrigued me enough to write this short article.
When I was playing with my data-generator and Apache Spark Structured Streaming, I was surprised by one behavior that I would like to share and explain in this post. To not deep delve into the details right now, the story will be about the use of nested structures in several operations.
After my January's talk about Apache Kafka integration in Structured Streaming I got an interesting question on off. The question was, how to process 2 topics simultaneously with Structured Streaming? The "small" problem was the fact that both had different schemas.