Massively Parallel Processing

Querying big amounts of data has never been so simple as nowadays. Amazon Redshift and Azure SQL Data Warehouse are one of the solutions. But using them wouldn't be possible without a more global concept known as MPP.

This post explains the meaning of that mysterious acronym. The first part defines the concepts without delving into details. The second section explains the main components of MPP architecture in the example of Azure SQL Data Warehouse.

Definition

MPP is a concept from parallel processing family. Before we focus on it, we'll see 2 other solutions from this category. The first one is called SMP for symmetric multiprocessors. In this shared-everything architecture multiple processors share memory and I/O. Thus, they're tightly coupled and it makes scaling them difficult. Another family member looks more like MPP and it's called SDC for shared-disk clusters. In this shared-something architecture the processors share disk but scaling them and providing high availability is easier than for SMP. However they require workload synchronization and balancing that makes them worse candidates than MPP.

MPP means Massively Parallel Processing. It's shared-nothing architecture used by popular data warehouse solutions as Amazon Redshift and Azure SQL Data Warehouse. Each processors has its own memory and external I/O. The cluster can be composed of a lot of commodity hardware nodes that makes the scaling cheaper than in 2 previous cases. However, this architecture also has a drawback. It requires pretty efficient communication layer to coordinate the work and communication of processors, e.g. during data shuffling for GROUP BY queries. But as told, scaling is easy thanks to resources separation.

Scalability is also influenced by the data organization. One of 3 strategies can be used:

We could summarize MPP in the following schema:

Azure Data Warehouse components

To see MPP implementation in details, let's analyze Azure SQL Data Warehouse architecture:

For the years divide-and-conquer strategy was one of solutions to speed up programs execution. It's especially true for Java world and ForkJoinPool. But with the arrival of Big Data this concept also made its proofs. One of the examples is Massively Parallel Processing architecture shown in this post. As described in the first section, it's based on a parallel execution of small chunks of a bigger query by processor units (commodity hardware nodes). The second part shown how this architecture is implemented in Azure cloud. We could clearly discover that separation of compute and storage helps to scale both layers independently and more efficiently in the case of growing load. However, this lack of coupling requires the existence of an extra layer to handle the communication between nodes, which adds some additional complexity to the whole architecture.