parallelization unit articles

Sacks - data parallelization unit in Gnocchi

To facilitate parallel processing Apache Spark and Apache Kafka have their concept of partitions, Apache Beam works with bundles and Gnocchi deals with sacks. Despite the different naming, the sacks are the same for Gnocchi as the partitions for Spark or Kafka - the unit of work parallelization.

Continue Reading β†’

Data partitioning in Apache Beam

The power of Big Data processing platforms resides mainly in the ability to parallelize processing on different nodes. Each framework has its own unit of parallelism. In Spark it's called partition. Apache Beam calls it bundle.

Continue Reading β†’

Partitioning internals in Spark

In October I published the post about Partitioning in Spark. It was an introduction to the partitioning part, mainly focused on basic information, as partitioners and partitioning transformations (coalesce and repartition). This time it's a good moment to take other partition points up.

Continue Reading β†’

Partitioning RDBMS data in Spark SQL

Without any explicit definition, Spark SQL won't partition any data, i.e. all rows will be processed by one executor. It's not optimal since Spark was designed to parallel and distributed processing.

Continue Reading β†’

Partitioning in Spark

Partitioning in distributed data is quite common concept. Spark is not an exception and it also has some operations related to partitions.

Continue Reading β†’