GCP BigQuery by an Apache Spark guy

One of the steps in my preparation for the GCP Data Engineer certificate was the work with "Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale" book. And to be honest, I didn't expect that knowing Apache Spark will help me so much in understanding the architectural concepts. If you don't believe, I will try to convince you in this blog post.

But before other things, if you think about learning BigQuery one day, the "Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale" by Jordan Tigani and Valliappa Lakshmanan should help you. And if you want to take a shortcut and discover BigQuery as an Apache Spark user, I hope this article will be somehow useful. It will start with the section introducing these tools and explaining why, despite the apparent difference, I decided to compare them. The second part will their architecture components and the last few aspects of the data processing part.

To be, or not to be the same?

As you certainly know, Apache Spark is a distributed data processing framework that you can plug into different data stores, like an object store, a data warehouse, or a streaming broker. Even though it can also be used as the ad-hoc querying service, for example, via notebook environments, its principal role - and it's doing it very well - is to process the data in a horizontally scalable manner.

Regarding BigQuery, the difference between the compute and storage is a bit more blurry. The service is described as a serverless cloud data warehouse, which automatically brings it to the data stores landscape. However, if your batch layer uses the ELT approach, you can also consider it as a limited, SQL-only, distributed computing environment. I don't know if you agree, but I would say that it's a first-class data warehouse solution with extra distributed data processing capability. Of course, this capability is not as flexible as Apache Spark, but it still exists. And because of this capability I decided to write this blog post and compare its compute concepts with Apache Spark's ones.

Architecture components

The query engine of BigQuery is called Dremel. Dremel is an internal Google product developed for 2006 which continuously evolved. And one of its features is the change of the execution plan at runtime, exactly like the Adaptive Query Execution in Apache Spark. So, we got the first similarities. But let's stop on it and as this section's title indicates, see the architecture components.

The first of them is called Query Master and indeed, it shares the role of the driver in Apache Spark. It's the entry point for the query, i.e., the place where the query is prepared for execution. Apart from parsing the SQL query and preparing the initial query plan, BigQuery also does the first optimizations, a bit like Apache Spark does with logical and physical optimization rules. And one of the most important optimizations is partition pruning, which will reduce the amount of processed data. Thanks to this dataset reduction, the query will use fewer slots to run and be cheaper.

Next, the Query Master interacts with the Scheduler to discover the physical location of the data processors. It repeats this operation for every stage of the query. The Scheduler's main role is then to provide the information about the executable places to the Query Master so that the work to do can be triggered. In Apache Spark this role belongs to the scheduler backend on the driver side (DriverEndpoint). And it's completed with DAGScheduler and TaskScheduler to allocate tasks to the executors.

The Worker Shard is a task running in Borg container manager (the inspiration for Kubernetes!) responsible for executing the allocated task. Initially, I was quite confused with it and thought it was the synonym of Apache Spark's task concept. Only when I read the explanation further that the Worker Shard can execute multiple tasks in parallel, I discovered that it's more like Apache Spark's executor.

So if the Worker Shard is an executor running some tasks, what is the slot? The official documentation says that slot is "a virtual CPU used by BigQuery to execute SQL queries." What about Apache Spark? The spark.task.cpus defines how many CPUs Apache Spark should request for every task. From that definition, I would consider slot as an alternative term used to describe a task in BigQuery. The authors of "Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale" seem to confirm my intuition with this sentence: "The Worker Shard itself is capable of running multiple tasks in parallel; each of these tasks represents a schedulable unit, which is the aforementioned slot". But wait. Doesn't it sound like a stage? Maybe, but if you check the documentation, you will see that it's only a feeling:"BigQuery slots execute individual units of work at each stage of the query. For example, if BigQuery determines that a stage's optimal parallelization factor is 10, it requests 10 slots to process that stage.". A slot is then a schedulable unit executed as a task on the Worker Shard.

Data processing

Besides the architecture components, you will also find few similarities in the data processing part. First and foremost, the shuffle. As in Apache Spark, this special stage is used in the grouping operations and as in Apache Spark, the number of shuffle sinks can change dynamically (AQE shuffle partition coalesce). BigQuery shuffle involves the same trade-offs as in Spark, so the disk spilling in case of a too big dataset and, of course, the network traffic. One of the recommended approaches to avoid spilling in BigQuery is to shard the query into multiple ones and process different subsets of the main dataset. The same can be achieved with Apache Spark for a partitioned dataset.

By the way, another data organization similarity are the "partitions of partitions" which in Apache Spark are called buckets. In BigQuery this feature is called clustering and brings exactly the same advantages like in Apache Spark, i.e. bucket pruning and dynamic partition pruning.

In addition to the data organization points, there is also something related to the data ingestion. BigQuery supports schema inference. However, as for Apache Spark, this approach has few drawbacks, especially if the dataset is not consistent and, for example, the same cell is represented as a string, a struct, and an array. That's why defining the schema explicitly is recommended practice to avoid such inconsistencies.

Also, BigQuery comes with the same limitations regarding loading compressed files as Apache Spark. Namely, the loading process won't be parallelizable. It's up to you to make the trade-off between loading throughput (better if compressed files) and loading performance (faster if work can be parallelized).

Of course, those are not the single similarities. Both tend to optimize their support for SQL ANSI and let you define your User-Defined Functions. To get an even better idea of that, I recommend either taking a look at the "Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale" book or in Dremel and BigQuery technical documentation.