Articles about Data processing on waitingforcode.com - articles for the pleasure of learning and discovery

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

May 25, 2023 • Apache Spark Structured Streaming

What's new in Apache Spark 3.4.0 - Structured Streaming and correctness issue

Apache Spark is infamous for its correctness issue for chained stateful operations. Fortunately things get improved in each release. The most recent one, the 3.4.0, also got some important changes on that field!

Continue Reading →

May 17, 2023 • Apache Spark Structured Streaming

What's new in Apache Spark 3.4.0 - Async progress tracking for Structured Streaming

Finally, the time has come to start the analysis of the new features in Apache Spark. The first of them that grabbed my attention was the Async progress tracking from Structured Streaming.

Continue Reading →

April 16, 2023 • Apache Spark SQL

Spark SQL checkpoints

In my long - but not long enough! - journey with Apache Spark I've met the "checkpointing" world in the context of Structured Streaming mostly. But this term also applies to other modules including Apache Spark SQL, so batch processing!

Continue Reading →

March 18, 2023 • Apache Spark

Introduction to Apache Spark History

If you need to go back in time and analyze your past Apache Spark applications, you can use the native Apache Spark History server. However, it can also be an infrastructure problem because of the continuously increasing historical logs for streaming applications. In this blog post we'll try to understand this component and to see different configuration options.

Continue Reading →

March 2, 2023 • Apache Spark SQL

Filtering rules accumulator

Data can have various quality issues, from missing to badly formatted values. However, there is another issue less people talk about, the erroneous filtering logic.

Continue Reading →

February 3, 2023 • Apache Spark

Apache Spark as you don't know it

It's difficult to see all the use cases of a framework. Back in time, when I was a backend engineer, I never succeeded to see all applications of Spring framework. Now, when I'm a data engineer, I feel the same for Apache Spark. Fortunately, the community is there to show me some outstanding features!

Continue Reading →

December 3, 2022 • PySpark

Shuffle in PySpark

Shuffle is for me a never-ending story. Last year I spent long weeks analyzing the readers and writers and was hoping for some rest in 2022. However, it didn't happen. My recent PySpark investigation led me to the shuffle.py file and my first reaction was "Oh, so PySpark has its own shuffle mechanism?". Let's check this out!

Continue Reading →

November 26, 2022 • PySpark

Serializers in PySpark

We've learned in the previous PySpark blog posts about the serialization overhead between the Python application and JVM. An intrinsic actor of this overhead are Python serializers that will be the topic of this article and hopefully, will provide a more complete overview of the Python <=> JVM serialization.

Continue Reading →

November 19, 2022 • Apache Spark SQL

Generated method too long to be JIT compiled

There are days like that. You inherit a code and it doesn't really work as expected. While digging into issues you find usual weird warnings but also several new things. For me one of these things was the "Generated method too long to be JIT compiled..." info message.

Continue Reading →

November 12, 2022 • Apache Spark

Apache Spark listeners

Message bus is a common architectural design in the Enterprise Design Patterns. But it's also present at a lower level to enable the event-driven behavior. Apache Spark is not an exception. It uses a publish/subscribe approach in various places.

Continue Reading →

November 5, 2022 • Apache Spark SQL

Wildcard path and partitions

Let's suppose you store the partitioned data under the /data/mydir location. What will be the difference if you read this directory with Apache Spark as /data/mydir/ and /data/mydir/* ? You should find the answer to the question just below.

Continue Reading →

October 8, 2022 • PySpark

PySpark and pyspark.zip story

The topic of this blog post is one of my first big surprises while I was learning the debugging of PySpark jobs. Usually I'm running the code locally in debug mode and the defined breakpoints help me understand what happens. That time, it was different!

Continue Reading →

October 1, 2022 • PySpark

PySpark and vectorized User-Defined Functions

The Scala API of Apache Spark SQL has various ways of transforming the data, from the native and User-Defined Function column-based functions, to more custom and row-level map functions. PySpark doesn't have this mapping feature but does have the User-Defined Functions with an optimized version called vectorized UDF!

Continue Reading →

September 24, 2022 • Apache Spark SQL

Observable metrics

Observability is a hot topic nowadays, not only for the data but also the software industry. Apache Spark innovates in this field a lot, including new metrics for Structured Streaming and an important update added in the 3.0.0 release that I missed at the time, which are the observable metrics.

Continue Reading →

September 17, 2022 • Apache Spark SQL

Predicate pushdown, why it doesn't work every time?

Pushdowns in Apache Spark are great to delegate some operations to the data sources. It's a great way to reduce the data volume to be processed in the job. However, there is one important gotcha. Watch out the definition of your predicate because from time to time, even though the pushdown predicate is supported by the data source, the predicate can still be executed by the Apache Spark job!

Continue Reading →

September 3, 2022 • Apache Spark

YARN or Kubernetes for Apache Spark?

I've written my first Kubernetes on Apache Spark blog post in 2018 with a try to answer the question, what Kubernetes can bring to Apache Spark? Four years later this resource manager is a mature Spark component, but a new question has arisen in my head. Should I stay on YARN or switch to Kubernetes?

Continue Reading →

July 30, 2022 • PySpark

What's new in Apache Spark 3.3.0 - PySpark

It's time for the last "What's new in Apache Spark 3.3.0..." before a break. Today we'll see what changed in PySpark. Spoiler alert: Pandas users should find one feature very exciting!

Continue Reading →

July 24, 2022 • Apache Spark Structured Streaming

What's new in Apache Spark 3.3.0 - Structured Streaming

Even though the Project Lightspeed is not there yet, Apache Spark Structured Streaming 3.3.0 has several interesting features that should make your daily life easier.

Continue Reading →

July 23, 2022 • Apache Spark SQL

What's new in Apache Spark 3.3.0 - Data Source V2

After a break for the Data+AI Summit retrospective, it's time to return to Apache Spark 3.3.0 and see what changed for the DataSource V2 API.

Continue Reading →

June 30, 2022 • Apache Spark SQL

What's new in Apache Spark 3.3 - new functions

New Apache SQL functions are a regular position in my "What's new in Apache Spark..." series. Let's see what has changed in the most recent (3.3.0) release!

Continue Reading →

⟵ Previous
1
2
3
4
5
6
7
8
9
Next ⟶

Data processing articles