Articles about Data processing on waitingforcode.com - articles for the pleasure of learning and discovery

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

March 5, 2022 • Apache Spark Structured Streaming

Dynamic resource allocation in Structured Streaming

Structured Streaming micro-batch mode inherits a lot of features from the batch part. Apart from the retry mechanism presented previously, it also has the same auto-scaling logic relying on the Dynamic Resource Allocation.

Continue Reading →

February 5, 2022 • Apache Spark

Ops practices in Apache Spark project

A good CI/CD process avoids many pitfalls related to manual operations. Apache Spark also has one based on Github Actions. Since this part of the project has been a small mystery for me, I wanted to spend some time exploring it.

Continue Reading →

January 29, 2022 • Apache Spark Structured Streaming

Broadcast join and changing static dataset

Last year I wrote a blog post about broadcasting in Structured Streaming and I got an interesting question under one of the demo videos. What happens if the joined static dataset in a broadcast mode gets new data? Let's check this out!

Continue Reading →

January 15, 2022 • Apache Spark Structured Streaming

Task retries in Apache Spark Structured Streaming

Unexpected things happen and sooner or later, any pipeline can fail. Hopefully, sometimes the errors may be temporary and automatically recovered after some retries. What if the job is a streaming one? Let's see here how Apache Spark Structured Streaming handles task retries in micro-batch and continuous modes!

Continue Reading →

January 8, 2022 • Apache Spark

Kubernetes concepts for Apache Spark

I had the idea for this blog post when I was preparing the "What's new in Apache Spark..." series. At that time, I was writing about Kubernetes in the context of Apache Spark but needed to "google" a lot of things aside - mostly the Kubernetes API terms.

Continue Reading →

January 1, 2022 • Apache Spark SQL

Distinct vs group by key difference

I've heard an opinion that using DISTINCT can have a negative impact on big data workloads, and that the queries with GROUP BY were more performant. Is it true for Apache Spark SQL?

Continue Reading →

December 25, 2021 • Apache Spark

What's new in Apache Spark 3.2.0 - miscellaneous changes

My Apache Spark 3.2.0 comes to its end. Today I'll focus on the miscellaneous changes, so all the improvements I couldn't categorize in the previous blog posts.

Continue Reading →

December 20, 2021 • Apache Spark SQL

What's new in Apache Spark 3.2.0 - Apache Parquet and Apache Avro improvements

I still have 2 topics remaining in my "What's new..." backlog. I'd like to share the first of them with you today, and see what changed for Apache Parquet and Apache Avro data sources.

Continue Reading →

December 11, 2021 • Apache Spark SQL

What's new in Apache Spark 3.2.0 - performance optimizations

Apache Spark 3.0 extended the static execution engine with a runtime optimization engine called Adaptive Query Execution. It has changed a lot since the very first release and so even in the most recent version! But AQE is not a single performance improvement and I hope you'll see this in the blog post!

Continue Reading →

December 4, 2021 • PySpark

What's new in Apache Spark 3.2.0 - PySpark and Pandas

Project Zen is an initiative to make PySpark more Pythonic and facilitate the Python programming experience. Apache Spark 3.2.0 made a next step in this direction by bringing Pandas to the API!

Continue Reading →

November 27, 2021 • Apache Spark SQL

What's new in Apache Spark 3.2.0 - Data Source V2

Even though Data Source V2 is present in the API for a while, every release brings something new to it. This time too and we'll see what through this blog post!

Continue Reading →

November 20, 2021 • Apache Spark

What's new in Apache Spark 3.2.0 - push-based shuffle

In the previous Apache Spark releases you could see many shuffle evolutions such as shuffle files tracking or pluggable storage interface. And the things don't change for 3.2.0 which comes with the push-based merge shuffle.

Continue Reading →

November 13, 2021 • Apache Spark SQL

What's new in Apache Spark 3.2.0 - SQL changes

Apache Spark SQL evolves and with each new release, it gets closer to the ANSI standard. The 3.2.0 release is not different and you can find many ANSI-related changes. But not only and hopefully, you'll discover all this in this blog post which has an unusual form because this time, I won't focus on the implementation details.

Continue Reading →

November 6, 2021 • Apache Spark Structured Streaming

What's new in Apache Spark 3.2.0 - Structured Streaming

After previous blog posts focusing on 2 specific Structured Streaming features, it's time to complete them with a list of other changes made in the 3.2.0 version!

Continue Reading →

October 30, 2021 • Apache Spark Structured Streaming

What's new in Apache Spark 3.2.0 - session windows

Initially I wanted to include the session windows in the blog post about Structured Streaming changes. But I changed my mind when I saw how many things it involves!

Continue Reading →

October 23, 2021 • Apache Spark Structured Streaming

What's new in Apache Spark 3.2.0 - RocksDB state store

It's big news for Apache Spark Structured Streaming users. RocksDB is now available as a Vanilla Spark-backed state store backend!

Continue Reading →

October 16, 2021 • Apache Spark

Stage level scheduling

The idea of writing this blog post came to me when I was analyzing Kubernetes changes in Apache Spark 3.1.1. Starting from this version we can use stage level scheduling, so far available only for YARN. Even though it's probably a very low level feature, it intrigued me enough to write a few words here!

Continue Reading →

August 28, 2021 • Apache Spark

Iterators in Apache Spark

I had this "aha moment" while I was preparing the blog posts about the shuffle readers. Apache Spark uses iterators a lot! In this blog post you will see the places where I had met them the last months.

Continue Reading →

August 21, 2021 • Apache Spark SQL

Shuffle reading in Apache Spark SQL - wrapping iterators and beyond

It's time for the 2nd blog post about the shuffle readers. Recently, we discovered how Apache Spark fetches the shuffle blocks from local and remote hosts. Today, I would like to share with you the wrapping iterators. Sounds mysterious? It won't be if we start by looking at the iterators participating in the processing of shuffle block files.

Continue Reading →

August 14, 2021 • Apache Spark SQL

Shuffle reading in Apache Spark SQL

So far I've covered the writing part of the shuffle files. You've learned about 3 different shuffle writers, but what happens with their generated files? Who and how reads them? Is the reading an in-memory operation? I will try to answer this and some other questions in this blog post.

Continue Reading →

⟵ Previous
1
2
3
4
5
6
7
8
9
10
Next ⟶

Data processing articles