Schema management in cloud streaming services

When I tell you "schema management" and "streaming", you'll certainly think about the schema registry of Apache Kafka. That's true but also streaming cloud services do manage the schemas and in this blog post we'll see how.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

In the blog post, I'll check how streaming services of 3 major cloud providers (AWS, Azure, GCP) deal with schemas. Initially, I called this article with a catchy "Schema registry is everywhere" title but after a deeper analysis, I understood that the term is inappropriate for GCP Pub/Sub. By the way, the service simply calls about schema management and not about a registry.

Schema management vs schema registry

Let's start then by spotting some differences between GCP Pub/Sub, AWS Glue Schema Registry and Azure Event Hubs Schema Registry. Pub/Sub implements the schemas management more as a metadata annotation for the topics than a separate metadata management layer because:

Similarities

Despite these differences mostly coming from a different perception of the schema in Pub/Sub, there are some similarities between the services, such as:

As you can see, on the cloud you'll find 2 ways to work with schemas. The first method used by Pub/Sub relies on a static schema associated with the topic. It's an easy way where the clients don't need to manage the schema part. This management is required in the second method based on a schema registry concept and implemented by Glue and Event Hubs. It requires a bit more effort on the client side but provides a more complete set of features with the schema versions and migration compatibility modes.