Articles about parallelization unit on waitingforcode.com

July 7, 2018 • Time series

Sacks - data parallelization unit in Gnocchi

To facilitate parallel processing Apache Spark and Apache Kafka have their concept of partitions, Apache Beam works with bundles and Gnocchi deals with sacks. Despite the different naming, the sacks are the same for Gnocchi as the partitions for Spark or Kafka - the unit of work parallelization.

Continue Reading →

December 22, 2017 • Apache Beam

Data partitioning in Apache Beam

The power of Big Data processing platforms resides mainly in the ability to parallelize processing on different nodes. Each framework has its own unit of parallelism. In Spark it's called partition. Apache Beam calls it bundle.

Continue Reading →

September 10, 2017 • Apache Spark

Partitioning internals in Spark

In October I published the post about Partitioning in Spark. It was an introduction to the partitioning part, mainly focused on basic information, as partitioners and partitioning transformations (coalesce and repartition). This time it's a good moment to take other partition points up.

Continue Reading →

July 22, 2017 • Apache Spark SQL

Partitioning RDBMS data in Spark SQL

Without any explicit definition, Spark SQL won't partition any data, i.e. all rows will be processed by one executor. It's not optimal since Spark was designed to parallel and distributed processing.

Continue Reading →

October 23, 2016 • Apache Spark

Partitioning in Spark

Partitioning in distributed data is quite common concept. Spark is not an exception and it also has some operations related to partitions.

Continue Reading →

parallelization unit articles

Sacks - data parallelization unit in Gnocchi

Data partitioning in Apache Beam

Partitioning internals in Spark

Partitioning RDBMS data in Spark SQL

Partitioning in Spark