Looking for something else? Check the categories of Data processing:
Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark
If not, below you can find all articles belonging to Data processing.
After several months spent as an "experimental" feature in Apache Spark, Kubernetes was officially promoted to a Generally Available scheduler in the 3.1 release! In this blog post, we'll discover the last changes made before this promotion.
I have a feeling that a lot of things related to the scalability happened in the 3.1 release. General Availability of Kubernetes that I will cover next week is only one of them. The second one is the nodes decommissioning!
Predicate pushdown is a data processing technique taking user-defined filters and executing them while reading the data. Apache Spark already supported it for Apache Parquet and RDBMS. Starting from Apache Spark 3.1.1, you can also use them for Apache Avro, JSON and CSV formats!
I mentioned it very shortly in the first blog post ever about PySpark. Thanks to the Project Zen initiative, the Python part of Apache Spark will become more Pythonic and user friendly. How? Let's check that in this blog post!
Aside from the joins presented in the previous blog post, Structured Streaming also got a few other interesting new features that I will present here.
In the previous blog post, you discovered what changed for joins in Apache Spark 3.1. If you remember the summary sentence, it was not the single join changes in this new release. Apart from them, you can also do a bit more with Structured Streaming joins!
I have waited for writing this blog post since the Data+AI Summit 2020, where Cheng Su presented the ongoing effort to improve shuffle and stream-to-stream joins in Apache Spark 3.1. And in this blog post, I will start by sharing what changed for the joins in the new release of the framework!
That's probably one of the most common questions you may have heard in preliminary job interviews. What's the difference between coalesce and repartition? Many answers exist, but instead of repeating them, I will try to dig a bit deeper in this blog post and see how the coalesce works.
Pivot operation presented 2 weeks ago transforms some cells into columns. The reverse one is called stack and it's time to see how it works!
I wish I could say once day: "I optimized Apache Spark pipelines in all possible ways". But I'm aware of the realty and that can be very hard to achieve. That's why I decided to rely on the experience shared by experienced Spark users in Spark+AI and, recently, Data+AI Summit, and write a summary list of interesting optimization tips from the past talks.
If you came to data engineering after having a BI career, you certainly know what the pivot is. It was not my case and was quite amazed by this operation that transforms values from rows into columns. If you want to understand how it's possible, this article will present some internals of pivoting data in Apache Spark.
Very often you will find Apache Spark performance tips related to the hardware (memory, GC) or the configuration parameters (shuffle partitions number, broadcast join threshold). But they're not the single ones you can implement. Moreover, IMO, you should start by the ones presented in this article and optimize your pipeline code before going into more complicated hardware and configuration tuning.
After the introductory part, it's time to share what I learned from the custom state store implementation.
After previous introductory posts, it's time to deep delve into the state store API and implement our own custom state store.
The title of this blog post is maybe one of the first problems you may encounter with PySpark (it was mine). Even though it's quite mysterious, it makes sense if you take a look at the root cause.
I don't know you, but me, when I first saw the code with createTempView method, I thought it created a temporary table in the metastore. But it's not true and in this blog post, you will see why.
After previous posts about native stateful operations, it's time to focus on the one where you can define your custom stateful logic.
Streaming joins are an interesting feature that heavily uses state store. Even though I already blogged about it in the past (2018), some changes were made and also - I hope so - my explanation capacity improved.
One of the not obvious things about the watermark is how it applies on the windows. At first glance, you could think that it will filter out the records produced before the watermark value. But it's not how it works for windows.
Shuffle accompanies distributed data processing from the very beginning. Apache Spark is not an exception, and one of the prominent features targeted for 3.1 release is the full support for the pluggable shuffle backend. But it's not the single effort made these days by the community to handle shuffle drawbacks. And you will see it in this blog post.