Looking for something else? Check the categories of Data processing:
Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark
If not, below you can find all articles belonging to Data processing.
Observability is a hot topic nowadays, not only for the data but also the software industry. Apache Spark innovates in this field a lot, including new metrics for Structured Streaming and an important update added in the 3.0.0 release that I missed at the time, which are the observable metrics.
Pushdowns in Apache Spark are great to delegate some operations to the data sources. It's a great way to reduce the data volume to be processed in the job. However, there is one important gotcha. Watch out the definition of your predicate because from time to time, even though the pushdown predicate is supported by the data source, the predicate can still be executed by the Apache Spark job!
I've written my first Kubernetes on Apache Spark blog post in 2018 with a try to answer the question, what Kubernetes can bring to Apache Spark? Four years later this resource manager is a mature Spark component, but a new question has arisen in my head. Should I stay on YARN or switch to Kubernetes?
It's time for the last "What's new in Apache Spark 3.3.0..." before a break. Today we'll see what changed in PySpark. Spoiler alert: Pandas users should find one feature very exciting!
Even though the Project Lightspeed is not there yet, Apache Spark Structured Streaming 3.3.0 has several interesting features that should make your daily life easier.
After a break for the Data+AI Summit retrospective, it's time to return to Apache Spark 3.3.0 and see what changed for the DataSource V2 API.
New Apache SQL functions are a regular position in my "What's new in Apache Spark..." series. Let's see what has changed in the most recent (3.3.0) release!
Joins are probably the most popular operation for combining datasets and Apache Spark supports multiple types of them already! In the new release, the framework got 2 new strategies, the storage-partitioned and row-level runtime filters.
The topic of this blog post is not new because the discussed sort algorithms are there from Apache Spark 2. But it happens that I've never had a chance to present them and today I'll try to do it now.
I remember the first PySpark codes I saw. They were pretty similar to the Scala ones I used to work with except one small detail, the yield keyword. Since then, I've understood their purpose but have been actively looking for an occasion to blog about them. Growing the PySpark section is a great opportunity for this!
Last time I introduced Py4j which is the bridge between Apache Spark JVM codebase and Python client applications. Today it's a great moment to take a deeper look at their interaction in the context of data processing defined with the RDD and DataFrame APIs.
In my quest for understanding PySpark better, the JVM in the Python world is the must-have stop. In this first blog post I'll focus on Py4J project and its usage in PySpark.
If you're like me and haven't had an opportunity to work with Spark on Hive, you're probably as confused as I had been about the tables. Hopefully, after reading this blog post you will understand that concept better!
Despite working with Apache Spark for a while, I still have some undiscovered components. One of them crossed my path while I was writing the first blog post from the ACID file formats series. The lucky one is the Catalog API.
The .withColumn function is apparently an inoffensive operation, just a way to add or change a column. True, but also hides some points that can even lead to the memory issues and we'll see them in this blog post.
Unit tests are the backbone of modern software but they only verify a particular unit of the application. What to do if we wanted to check the interaction between all these units? One of the solutions are automated integration tests. While they are relatively easy to implement against data in-rest, they are more challenging for streaming scenarios.
It's time for the last part of the shuffle configuration overview. This time you'll see the properties related to the shuffle service, reducer, I/O, and a few others.
It's time for the 2 of 3 parts dedicated to the shuffle configuration in Apache Spark.
Probably the most popular configuration entry related to the shuffle is the number of shuffle partitions. But it's not the only one and you will see it in this new blog post series!
Structured Streaming micro-batch mode inherits a lot of features from the batch part. Apart from the retry mechanism presented previously, it also has the same auto-scaling logic relying on the Dynamic Resource Allocation.