AWS Redshift vs GCP BigQuery

Despite the recent architectural proposals with the lakehouse principle, a data warehouse is still an important part of a data system. But there is no "a single way" to do it and if you analyze the cloud providers, you will see various offerings like Redshift (AWS) or BigQuery (GCP), presented in this article.

In the blog post, I will compare AWS Redshift with GPC BigQuery, so 2 quite popular cloud-native data warehouses. It's organized in different short sections presenting one compared point each.

Architecture Both solutions rely on Massively Parallel Processing architecture. But its exposition and implementation are different. As a serverless service, BigQuery hides all infrastructure details to the user. Redshift does the opposite and lets the user configure the cluster.

Ops Having this infrastructure to manage on Redshift also requires a bit more operational effort, for example, when you run out of free disk space. Hopefully, you can easily configure an alarm to be aware of this event. Nonetheless, compared to BigQuery, and it's an extra thing to manage.

Workload management But do not get me wrong, BigQuery is not a manage-free solution. You don't have the infrastructure to manage, but you can have other things, like query costs monitoring or the concurrency among the users. In BigQuery, the "compute power" allocated to the query execution is measured in slots, and the solution uses fair scheduling to distribute them. Each project gets 2000 slots, but with the API, you can control the amount of data that every user can query. In Redshift, to control the concurrency, you will use the concept called queues. Each queue has an allocated memory and a number of allowed concurrent queries. The queries go to one of the queues regarding the user group executing them or the query group label.

Long queries The workload management brings another topic, the long queries. In Redshift, you can define a query monitoring rule on a queue and specify what should happen if the running query breaks this rule. For example, you could use it to stop the queries running for too long, initially allocated in a queue reserved for the short-running queries. BigQuery also has a concept of "longness" but it applies to the size of queried data with the --maximum_bytes_billed parameter.

Pricing The serverless model and the billing flag presented above show the BigQuery pricing model. Yes, it's based on the volume of processed data, and to process 1TB of data, you'll have to spend between $5 and $6, depending on the region. And it will be only a part of the invoice. In addition to that, you'll have to pay for the storage and streaming inserts if you decide to ingest data directly from your streaming broker. Hopefully, some operations like export and import are free. But overall, the total cost of use may be difficult to predict. On the other hand, the server-full Redshift gives a way better idea of the billing, at least for the hardware part ($0.30 to $7.60, depending on the node). Its serverless version, Redshift Spectrum, has similar costs as BigQuery ($5.00 per 1TB scanned).

Scaling Despite the hardware to manage, Redshift is scalable through 2 available resizing modes called elastic and classic. The elastic mode is very fast (10-15 minutes) and supports changing the number or/and types of nodes by a factor of 2, e.g., from 2 to 4 or 8 to 4. For the non-supported operations, you have to use a much slower classic resizing that can take 2 hours or more! Apart from the cluster scaling, you can also scale the compute power with a feature called concurrency scaling. The idea is to execute eligible queries on the concurrency scaling cluster as long as the main cluster doesn't have available compute power to process them. Regarding BigQuery, you may think there is no need to scale, but it's not completely true. Every project has 2 000 slots allocated, and if there are too many long-running queries, it can also be slow. To overcome the issue, you can use flex slots that will burst your compute capacity when needed.

Caching That's one of my surprises; both data warehouses offer query results caching capabilities. The conditions for reusing query results are also similar. The query can't change, can't contain unpredictable functions like NOW(), the data must remain the same, and the user issuing the query must have access to the underlying data. Of course, the data warehouse doesn't keep the cache indefinitely and doesn't cache all results.

SQL support If you know SQL, you shouldn't be lost when using one or another. There are some differences in the supported types, tables definition in the FROM clause, or still function namings (e.g. MEDIAN in Redshift = PERCENTILE_COUNT(median_expression, 0.5) OVER() in BigQuery).

Dimensional modeling For the dimensional models, so popular in the BI world related to the data warehousing, there are 2 different approaches. Redshift - even though the BigQuery's approach is still possible! - proposes a data colocation-based solution; i.e. you can locate the rows sharing the same value for a given key on the same node. Thanks to that Redshift will perform all per-key operations like aggregations or joins locally. To know all available data organization styles, look for distribution styles. BigQuery recommends a fully denormalized approach based on the NESTED and REPEATED column types.

Dataset reduction querying I couldn't find a better name here, but the idea is to see the techniques proposed by both data warehouses to reduce the volume of queried data. And in BigQuery, it's crucial since it's billed on the amount of processed data. To reduce the dataset size you can use partitioning and clustering. With the former operation, you can organize the data into sub-tables for every numeric or temporal column. And you can also subdivide these sub-tables by organizing correlated data together and composing the clusters. Both concepts don't exist in Redshift, but the AWS solution proposes another technique to accelerate the queries executions. This technique uses sort keys to get a better idea of the blocks to skip during a filter operation.

Federated data sources They are pretty evenly supported in both solutions. AWS Redshift can read data located in S3 with Redshift Spectrum but it can also use RDS or Aurora. You will find similar options for BigQuery. In addition to the GCS and CloudSQL, you will also have a possibility to query BigTable and Google Drive files.

Encryption. Both data warehouses can store the datasets in an encrypted format. It has to be explicitly enabled in Redshift, whereas BigQuery encrypts the data by default. As an encryption key, you can use your own (aka Customer-Managed Encryption Key) and integrated with Key Management Service (KMS; what luck, it's called the same on both cloud providers!).

Fine-grained access Can we go further with the security and not only encrypt the data but also control who can access what? Yes, we can, and not necessarily with an intermediary data exposition layer like the views! Redshift's GRANT command supports the definition of the columns of a table granted to the user. BigQuery uses Data Catalog service and its policy tags feature.

Access And what about a classical access? Redshift is often described as a distributed version of PostgreSQL because this database is the backbone of the service. That's also the reason why the data access is managed with classical database users. It doesn't apply to BigQuery whose authorization mechanism uses the IAM mechanism.

I don't want to terminate by saying whether one is better than the other, faster or cheaper. All is relative. If you don't have ops skills in the team and can't/don't want to spend money for hiring some skilled people, maybe then BigQuery will be better. If you do and want to have predictable bills, then perhaps Redshift will be more appropriate. Or maybe if your financial department is ok with making some reservations on the cloud resources, flat-pricing and serverless character of BigQuery will win. What is better then? Hard to say; the answer will always depend on the context of the question!