Data partitioning strategies

Every data processing pipeline can have a source of contention. One of them can be the data localization. When all entries are read from single place by dozens or hundreds of workers, the data source can respond slower. One of solutions to this problem can be the partitioning.

A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free

👉 Register here

This post explains the concept of partitioning in data-driven applications. The first part defines the partitioning and describes its 3 possible schemes. The second part lists available partitioning algorithms, showing their sample implementations in Scala.

Partitioning definition

The partitioning divides a dataset into smaller pieces called partitions. Each partition is identified by a unique name and any operation involving it should be transparent to the final user, i.e. the data engine should handle the query to the appropriate partition (even if user could also be able to interact with one explicit partition).

Thanks to the partitioning any big dataset can be divided and distributed in different places (e.g. nodes). This distribution brings the possibility of processing parallelisation. To take a simple example, one worker can deal with the data accumulated in 2016 while the second one can work on the data from 2017. However, to achieve a good optimization gain, the partitions should be equilibrated, i.e. the subdivisions of data stored on each of them should have similar size.

Three popular partitioning schemes exist:

Partitioning algorithms

Big Data processing frameworks (e.g. Apache Spark) and NoSQL engines (e.g. Cassandra, Kafka) use the concept of partitioning to the same purposes as described in previous section. Among all these solutions we can distinguish some common partitioning approaches:

As proved in this post, the partitioning can have a beneficial influence on the data processing. It can speed up querying and writing by distributing the load, without any complexification of client code. However, in order to work efficiently, the data must be distributed evenly. And as it was shown in the second section describing some basic partitioning algorithms, it's not always easy.