Data partitioning strategies

Every data processing pipeline can have a source of contention. One of them can be the data localization. When all entries are read from single place by dozens or hundreds of workers, the data source can respond slower. One of solutions to this problem can be the partitioning.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

This post explains the concept of partitioning in data-driven applications. The first part defines the partitioning and describes its 3 possible schemes. The second part lists available partitioning algorithms, showing their sample implementations in Scala.

Partitioning definition

The partitioning divides a dataset into smaller pieces called partitions. Each partition is identified by a unique name and any operation involving it should be transparent to the final user, i.e. the data engine should handle the query to the appropriate partition (even if user could also be able to interact with one explicit partition).

Thanks to the partitioning any big dataset can be divided and distributed in different places (e.g. nodes). This distribution brings the possibility of processing parallelisation. To take a simple example, one worker can deal with the data accumulated in 2016 while the second one can work on the data from 2017. However, to achieve a good optimization gain, the partitions should be equilibrated, i.e. the subdivisions of data stored on each of them should have similar size.

Three popular partitioning schemes exist:

Partitioning algorithms

Big Data processing frameworks (e.g. Apache Spark) and NoSQL engines (e.g. Cassandra, Kafka) use the concept of partitioning to the same purposes as described in previous section. Among all these solutions we can distinguish some common partitioning approaches:

As proved in this post, the partitioning can have a beneficial influence on the data processing. It can speed up querying and writing by distributing the load, without any complexification of client code. However, in order to work efficiently, the data must be distributed evenly. And as it was shown in the second section describing some basic partitioning algorithms, it's not always easy.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©