Articles about Data processing on waitingforcode.com - articles for the pleasure of learning and discovery

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

June 27, 2022 • Apache Spark SQL

What's new in Apache Spark 3.3 - joins

Joins are probably the most popular operation for combining datasets and Apache Spark supports multiple types of them already! In the new release, the framework got 2 new strategies, the storage-partitioned and row-level runtime filters.

Continue Reading →

June 18, 2022 • Apache Spark SQL

Radix and Tim sort

The topic of this blog post is not new because the discussed sort algorithms are there from Apache Spark 2. But it happens that I've never had a chance to present them and today I'll try to do it now.

Continue Reading →

June 11, 2022 • PySpark

Generators and PySpark

I remember the first PySpark codes I saw. They were pretty similar to the Scala ones I used to work with except one small detail, the yield keyword. Since then, I've understood their purpose but have been actively looking for an occasion to blog about them. Growing the PySpark section is a great opportunity for this!

Continue Reading →

June 4, 2022 • PySpark

PySpark and the JVM - introduction, part 2

Last time I introduced Py4j which is the bridge between Apache Spark JVM codebase and Python client applications. Today it's a great moment to take a deeper look at their interaction in the context of data processing defined with the RDD and DataFrame APIs.

Continue Reading →

April 30, 2022 • PySpark

PySpark and the JVM - introduction, part 1

In my quest for understanding PySpark better, the JVM in the Python world is the must-have stop. In this first blog post I'll focus on Py4J project and its usage in PySpark.

Continue Reading →

April 23, 2022 • Apache Spark SQL

Tables and Apache Spark

If you're like me and haven't had an opportunity to work with Spark on Hive, you're probably as confused as I had been about the tables. Hopefully, after reading this blog post you will understand that concept better!

Continue Reading →

April 16, 2022 • Apache Spark SQL

Pluggable Catalog API

Despite working with Apache Spark for a while, I still have some undiscovered components. One of them crossed my path while I was writing the first blog post from the ACID file formats series. The lucky one is the Catalog API.

Continue Reading →

April 9, 2022 • Apache Spark SQL

Beware of .withColumn

The .withColumn function is apparently an inoffensive operation, just a way to add or change a column. True, but also hides some points that can even lead to the memory issues and we'll see them in this blog post.

Continue Reading →

April 2, 2022 • Apache Spark Structured Streaming

Integration tests and Structured Streaming

Unit tests are the backbone of modern software but they only verify a particular unit of the application. What to do if we wanted to check the interaction between all these units? One of the solutions are automated integration tests. While they are relatively easy to implement against data in-rest, they are more challenging for streaming scenarios.

Continue Reading →

March 26, 2022 • Apache Spark

Shuffle configuration demystified - part 3

It's time for the last part of the shuffle configuration overview. This time you'll see the properties related to the shuffle service, reducer, I/O, and a few others.

Continue Reading →

March 19, 2022 • Apache Spark

Shuffle configuration demystified - part 2

It's time for the 2 of 3 parts dedicated to the shuffle configuration in Apache Spark.

Continue Reading →

March 12, 2022 • Apache Spark

Shuffle configuration demystified - part 1

Probably the most popular configuration entry related to the shuffle is the number of shuffle partitions. But it's not the only one and you will see it in this new blog post series!

Continue Reading →

March 5, 2022 • Apache Spark Structured Streaming

Dynamic resource allocation in Structured Streaming

Structured Streaming micro-batch mode inherits a lot of features from the batch part. Apart from the retry mechanism presented previously, it also has the same auto-scaling logic relying on the Dynamic Resource Allocation.

Continue Reading →

February 5, 2022 • Apache Spark

Ops practices in Apache Spark project

A good CI/CD process avoids many pitfalls related to manual operations. Apache Spark also has one based on Github Actions. Since this part of the project has been a small mystery for me, I wanted to spend some time exploring it.

Continue Reading →

January 29, 2022 • Apache Spark Structured Streaming

Broadcast join and changing static dataset

Last year I wrote a blog post about broadcasting in Structured Streaming and I got an interesting question under one of the demo videos. What happens if the joined static dataset in a broadcast mode gets new data? Let's check this out!

Continue Reading →

January 15, 2022 • Apache Spark Structured Streaming

Task retries in Apache Spark Structured Streaming

Unexpected things happen and sooner or later, any pipeline can fail. Hopefully, sometimes the errors may be temporary and automatically recovered after some retries. What if the job is a streaming one? Let's see here how Apache Spark Structured Streaming handles task retries in micro-batch and continuous modes!

Continue Reading →

January 8, 2022 • Apache Spark

Kubernetes concepts for Apache Spark

I had the idea for this blog post when I was preparing the "What's new in Apache Spark..." series. At that time, I was writing about Kubernetes in the context of Apache Spark but needed to "google" a lot of things aside - mostly the Kubernetes API terms.

Continue Reading →

January 1, 2022 • Apache Spark SQL

Distinct vs group by key difference

I've heard an opinion that using DISTINCT can have a negative impact on big data workloads, and that the queries with GROUP BY were more performant. Is it true for Apache Spark SQL?

Continue Reading →

December 25, 2021 • Apache Spark

What's new in Apache Spark 3.2.0 - miscellaneous changes

My Apache Spark 3.2.0 comes to its end. Today I'll focus on the miscellaneous changes, so all the improvements I couldn't categorize in the previous blog posts.

Continue Reading →

December 20, 2021 • Apache Spark SQL

What's new in Apache Spark 3.2.0 - Apache Parquet and Apache Avro improvements

I still have 2 topics remaining in my "What's new..." backlog. I'd like to share the first of them with you today, and see what changed for Apache Parquet and Apache Avro data sources.

Continue Reading →

⟵ Previous
1
2
3
4
5
6
7
8
9
10
Next ⟶

Data processing articles