When I tell you "schema management" and "streaming", you'll certainly think about the schema registry of Apache Kafka. That's true but also streaming cloud services do manage the schemas and in this blog post we'll see how.
New ebook 🔥
Learn 84 ways to solve common data engineering problems with cloud services.
In the blog post, I'll check how streaming services of 3 major cloud providers (AWS, Azure, GCP) deal with schemas. Initially, I called this article with a catchy "Schema registry is everywhere" title but after a deeper analysis, I understood that the term is inappropriate for GCP Pub/Sub. By the way, the service simply calls about schema management and not about a registry.
Schema management vs schema registry
Let's start then by spotting some differences between GCP Pub/Sub, AWS Glue Schema Registry and Azure Event Hubs Schema Registry. Pub/Sub implements the schemas management more as a metadata annotation for the topics than a separate metadata management layer because:
- the schema is associated to the topic; for Glue and Event Hubs it's a separate entity grouped under Schema Registry for Glue and schema groups for Event Hubs
- the schema is immutable, once assigned, it cannot be modified and deleted; a deleted schema makes all writes to the topic impossible; on Glue and Event Hubs a deleted schema "only" means losing the control over the data consistency (I put the "only" between quotes because it's a big deal but smaller compared to the inability to produce the data) there is no way to define the compatibility mode
- the schema can be assigned only once, when the topic is created; Glue and Event Hubs are consumer-centric because the schema is managed at the client level
Despite these differences mostly coming from a different perception of the schema in Pub/Sub, there are some similarities between the services, such as:
- Apache Avro support - it's not the single available format but the most popular since it's supported in all of the 3 solutions. In addition to it, there are some other allowed schema definitions. Pub/Sub supports Protobuf and Glue JSON.
- compatibility modes - except Pub/Sub, other schema registries support compatibility modes
- versions - compatibility modes automatically involve the support for different versions of a schema.
- open - the registry works naturally with the cloud provider services but can interact with Open Source other tools like Kafka Streams or Kafka Connect.
As you can see, on the cloud you'll find 2 ways to work with schemas. The first method used by Pub/Sub relies on a static schema associated with the topic. It's an easy way where the clients don't need to manage the schema part. This management is required in the second method based on a schema registry concept and implemented by Glue and Event Hubs. It requires a bit more effort on the client side but provides a more complete set of features with the schema versions and migration compatibility modes.