Table file formats - reading path: Apache Hudi

After Delta Lake and Apache Iceberg it's time to see the reading part of Apache Hudi. Despite an apparent similarity with the aforementioned table formats, Apache Hudi has an interesting reading specificity related to the different table types.

Continue Reading →

PySpark and vectorized User-Defined Functions

The Scala API of Apache Spark SQL has various ways of transforming the data, from the native and User-Defined Function column-based functions, to more custom and row-level map functions. PySpark doesn't have this mapping feature but does have the User-Defined Functions with an optimized version called vectorized UDF!

Continue Reading →

Table file formats - reading path: Apache Iceberg

Last week you could read about data reading in Delta Lake. Today it's time to cover this part in Apache Iceberg!

Continue Reading →

Observable metrics

Observability is a hot topic nowadays, not only for the data but also the software industry. Apache Spark innovates in this field a lot, including new metrics for Structured Streaming and an important update added in the 3.0.0 release that I missed at the time, which are the observable metrics.

Continue Reading →

Table formats - reading: Delta Lake

In the previous blog post about Delta Lake you discovered the logic for the writing part. Meantime Delta Lake 2 was released and it's for this brand new version that I'm going to share with you some findings related to the data reading.

Continue Reading →

Predicate pushdown, why it doesn't work every time?

Pushdowns in Apache Spark are great to delegate some operations to the data sources. It's a great way to reduce the data volume to be processed in the job. However, there is one important gotcha. Watch out the definition of your predicate because from time to time, even though the pushdown predicate is supported by the data source, the predicate can still be executed by the Apache Spark job!

Continue Reading →

What's new on the cloud for data engineers - part 7 (05-08.2022)

Four months in cloud history is a huge period of time. Even when 2 of the 4 months are the usual "holiday" months. As you can guess from the title, it's time to see what changed recently on the cloud from a data engineering perspective!

Continue Reading →

YARN or Kubernetes for Apache Spark?

I've written my first Kubernetes on Apache Spark blog post in 2018 with a try to answer the question, what Kubernetes can bring to Apache Spark? Four years later this resource manager is a mature Spark component, but a new question has arisen in my head. Should I stay on YARN or switch to Kubernetes?

Continue Reading →

My ideal data engineer job posting

The "Data is the new Oil" is one of popular sentences describing the huge role of data in our world. And as other resources, data must be extracted too. To find these "Oil workers", organizations look for, among others, data engineers. The task is more or less easier and this difficulty depends on various factors. From my 6-years perspective, one of the key starting elements is the job announcement.

Continue Reading →

What's new in Apache Spark 3.3.0 - PySpark

It's time for the last "What's new in Apache Spark 3.3.0..." before a break. Today we'll see what changed in PySpark. Spoiler alert: Pandas users should find one feature very exciting!

Continue Reading →

What's new in Apache Spark 3.3.0 - Structured Streaming

Even though the Project Lightspeed is not there yet, Apache Spark Structured Streaming 3.3.0 has several interesting features that should make your daily life easier.

Continue Reading →

What's new in Apache Spark 3.3.0 - Data Source V2

After a break for the Data+AI Summit retrospective, it's time to return to Apache Spark 3.3.0 and see what changed for the DataSource V2 API.

Continue Reading →

Data+AI Summit 2022 retrospective - part 2

Yesterday I shared with you the human part of my Data+AI Summit. It's time now to give you my takeaways from the technical talks.

Continue Reading →

Data+AI Summit 2022 retrospective - part 1

There will be many "first times" in our lives. For me, the Data+AI Summit 2022 was the first time I've visited the USA, put the 3D dimensions to the pictures of my virtual friends and felt a huge community support in a very troubled moment. Besides, I also enjoyed the talks and walking, even though the latter one wasn't so good for my skin ;)

Continue Reading →

Shedding some light on Azure SQL

When I prepare the "What's new on the cloud..." series, I'm pretty sure that for Azure the most updates will go to the Azure SQL service. The main idea of the service is simple but if you analyze it more deeply, you'll find some concepts that might not be the easiest to understand at first.

Continue Reading →