Schema management in cloud streaming services

When I tell you "schema management" and "streaming", you'll certainly think about the schema registry of Apache Kafka. That's true but also streaming cloud services do manage the schemas and in this blog post we'll see how.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

In the blog post, I'll check how streaming services of 3 major cloud providers (AWS, Azure, GCP) deal with schemas. Initially, I called this article with a catchy "Schema registry is everywhere" title but after a deeper analysis, I understood that the term is inappropriate for GCP Pub/Sub. By the way, the service simply calls about schema management and not about a registry.

Schema management vs schema registry

Let's start then by spotting some differences between GCP Pub/Sub, AWS Glue Schema Registry and Azure Event Hubs Schema Registry. Pub/Sub implements the schemas management more as a metadata annotation for the topics than a separate metadata management layer because:

Similarities

Despite these differences mostly coming from a different perception of the schema in Pub/Sub, there are some similarities between the services, such as:

As you can see, on the cloud you'll find 2 ways to work with schemas. The first method used by Pub/Sub relies on a static schema associated with the topic. It's an easy way where the clients don't need to manage the schema part. This management is required in the second method based on a schema registry concept and implemented by Glue and Event Hubs. It requires a bit more effort on the client side but provides a more complete set of features with the schema versions and migration compatibility modes.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©