Python alternatives to PySpark

PySpark has been getting interesting improvements making it more Python and user-friendly in each release. However, it's not the single Python-based framework for distributed data processing and people talk more and more often about the alternatives like Dask or Ray. Since both are completely new for me, I'm going to use this blog post to shed some light on them, and why not plan a deeper exploration next year?

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

I'm analyzing Dask and Ray from a data engineering perspective, mainly focusing on the processing part. I'm not a data scientist but am going to do my best in trying to understand the pain points these 2 frameworks help solve in this domain. That being said, the analysis covers high-level aspects only I could understand from the documentation.

Dask in a nutshell

The best definition of Dask I've found so far comes from the Is Dask mature? Why should we trust it? question in the Dask FAQ. Let me quote it before exploring other aspects:

Yes. While Dask itself is relatively new (it began in 2015) it is built by the NumPy, Pandas, Jupyter, Scikit-Learn developer community, which is well trusted.Dask is a relatively thin wrapper on top of these libraries and, as a result, the project can be relatively small and simple. It doesn’t reinvent a whole new system.

At first it sounds familiar to the Apache Beam project which is also a wrapper on top of other data processing frameworks and services (Apache Spark, Dataflow, Apache Flink, ...). But after a deeper analysis, there are some important points that don't reduce Dask into a simple API-based project:

Before I pass to Ray, I must share one sentence that caught my attention in the Dask marketing space:

And that's totally the feeling I had while reading the documentation. If you check out the section comparing Dask and Spark, you will see there is nothing like "we're the best". It's a very pragmatic vision of two tools that will outperform in 2 different domains (ML workloads vs. data engineering workloads).

Ray in a nutshell

Ray has a lot of similarities with Dask, including:

And PySpark?

PySpark has been improving a lot in the last releases to better fit into this Python data landscape. There is an ongoing Project Zen initiative to make PySpark more Pythonic with the support for Pandas DataFrames, easier onboarding, or type hints.

Despite this ongoing effort, I can now better understand the difference between PySpark, Dask, and Ray. While I would start with PySpark for any pure data processing workload, I would think about using Dask or Ray for a more data science-specific use case. Again, I'm not a data science expert but my feeling is that there are some use cases where a PySpark implementation will be way more complex (if not, impossible to write!) than a Dask's or Ray's one. That being said, it's not necessarily PySpark but more the Map/Reduce model that seems to be less adapted to the complex ML algorithms.

Dask and Ray communities do a great job on comparing things. Comparison to Spark explains the differences between Dask and Apache Spark. There is also a great Github issue focusing on the Dask and Ray. So if you're still hungry after reading this short blog post, these links are a great way to discover more!