Looking for something else? Check the categories of Data processing:
Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark
If not, below you can find all articles belonging to Data processing.
If you're like me and haven't had an opportunity to work with Spark on Hive, you're probably as confused as I had been about the tables. Hopefully, after reading this blog post you will understand that concept better!
Despite working with Apache Spark for a while, I still have some undiscovered components. One of them crossed my path while I was writing the first blog post from the ACID file formats series. The lucky one is the Catalog API.
The .withColumn function is apparently an inoffensive operation, just a way to add or change a column. True, but also hides some points that can even lead to the memory issues and we'll see them in this blog post.
Unit tests are the backbone of modern software but they only verify a particular unit of the application. What to do if we wanted to check the interaction between all these units? One of the solutions are automated integration tests. While they are relatively easy to implement against data in-rest, they are more challenging for streaming scenarios.
It's time for the last part of the shuffle configuration overview. This time you'll see the properties related to the shuffle service, reducer, I/O, and a few others.
It's time for the 2 of 3 parts dedicated to the shuffle configuration in Apache Spark.
Probably the most popular configuration entry related to the shuffle is the number of shuffle partitions. But it's not the only one and you will see it in this new blog post series!
Structured Streaming micro-batch mode inherits a lot of features from the batch part. Apart from the retry mechanism presented previously, it also has the same auto-scaling logic relying on the Dynamic Resource Allocation.
A good CI/CD process avoids many pitfalls related to manual operations. Apache Spark also has one based on Github Actions. Since this part of the project has been a small mystery for me, I wanted to spend some time exploring it.
Last year I wrote a blog post about broadcasting in Structured Streaming and I got an interesting question under one of the demo videos. What happens if the joined static dataset in a broadcast mode gets new data? Let's check this out!
Unexpected things happen and sooner or later, any pipeline can fail. Hopefully, sometimes the errors may be temporary and automatically recovered after some retries. What if the job is a streaming one? Let's see here how Apache Spark Structured Streaming handles task retries in micro-batch and continuous modes!
I had the idea for this blog post when I was preparing the "What's new in Apache Spark..." series. At that time, I was writing about Kubernetes in the context of Apache Spark but needed to "google" a lot of things aside - mostly the Kubernetes API terms.
I've heard an opinion that using DISTINCT can have a negative impact on big data workloads, and that the queries with GROUP BY were more performant. Is it true for Apache Spark SQL?
My Apache Spark 3.2.0 comes to its end. Today I'll focus on the miscellaneous changes, so all the improvements I couldn't categorize in the previous blog posts.
I still have 2 topics remaining in my "What's new..." backlog. I'd like to share the first of them with you today, and see what changed for Apache Parquet and Apache Avro data sources.
Apache Spark 3.0 extended the static execution engine with a runtime optimization engine called Adaptive Query Execution. It has changed a lot since the very first release and so even in the most recent version! But AQE is not a single performance improvement and I hope you'll see this in the blog post!
Project Zen is an initiative to make PySpark more Pythonic and facilitate the Python programming experience. Apache Spark 3.2.0 made a next step in this direction by bringing Pandas to the API!
Even though Data Source V2 is present in the API for a while, every release brings something new to it. This time too and we'll see what through this blog post!
In the previous Apache Spark releases you could see many shuffle evolutions such as shuffle files tracking or pluggable storage interface. And the things don't change for 3.2.0 which comes with the push-based merge shuffle.
Apache Spark SQL evolves and with each new release, it gets closer to the ANSI standard. The 3.2.0 release is not different and you can find many ANSI-related changes. But not only and hopefully, you'll discover all this in this blog post which has an unusual form because this time, I won't focus on the implementation details.