Schema management in cloud streaming services

When I tell you "schema management" and "streaming", you'll certainly think about the schema registry of Apache Kafka. That's true but also streaming cloud services do manage the schemas and in this blog post we'll see how.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ contact@waitingforcode.com πŸ“©

In the blog post, I'll check how streaming services of 3 major cloud providers (AWS, Azure, GCP) deal with schemas. Initially, I called this article with a catchy "Schema registry is everywhere" title but after a deeper analysis, I understood that the term is inappropriate for GCP Pub/Sub. By the way, the service simply calls about schema management and not about a registry.

Schema management vs schema registry

Let's start then by spotting some differences between GCP Pub/Sub, AWS Glue Schema Registry and Azure Event Hubs Schema Registry. Pub/Sub implements the schemas management more as a metadata annotation for the topics than a separate metadata management layer because:

Similarities

Despite these differences mostly coming from a different perception of the schema in Pub/Sub, there are some similarities between the services, such as:

As you can see, on the cloud you'll find 2 ways to work with schemas. The first method used by Pub/Sub relies on a static schema associated with the topic. It's an easy way where the clients don't need to manage the schema part. This management is required in the second method based on a schema registry concept and implemented by Glue and Event Hubs. It requires a bit more effort on the client side but provides a more complete set of features with the schema versions and migration compatibility modes.