Data partitioning strategies

Every data processing pipeline can have a source of contention. One of them can be the data localization. When all entries are read from single place by dozens or hundreds of workers, the data source can respond slower. One of solutions to this problem can be the partitioning.

New ebook 🔥

Learn 84 ways to solve common data engineering problems with cloud services.

👉 I want my Early Access edition

This post explains the concept of partitioning in data-driven applications. The first part defines the partitioning and describes its 3 possible schemes. The second part lists available partitioning algorithms, showing their sample implementations in Scala.

Partitioning definition

The partitioning divides a dataset into smaller pieces called partitions. Each partition is identified by a unique name and any operation involving it should be transparent to the final user, i.e. the data engine should handle the query to the appropriate partition (even if user could also be able to interact with one explicit partition).

Thanks to the partitioning any big dataset can be divided and distributed in different places (e.g. nodes). This distribution brings the possibility of processing parallelisation. To take a simple example, one worker can deal with the data accumulated in 2016 while the second one can work on the data from 2017. However, to achieve a good optimization gain, the partitions should be equilibrated, i.e. the subdivisions of data stored on each of them should have similar size.

Three popular partitioning schemes exist:

Partitioning algorithms

Big Data processing frameworks (e.g. Apache Spark) and NoSQL engines (e.g. Cassandra, Kafka) use the concept of partitioning to the same purposes as described in previous section. Among all these solutions we can distinguish some common partitioning approaches:

As proved in this post, the partitioning can have a beneficial influence on the data processing. It can speed up querying and writing by distributing the load, without any complexification of client code. However, in order to work efficiently, the data must be distributed evenly. And as it was shown in the second section describing some basic partitioning algorithms, it's not always easy.