Before I came to data engineering, I was working a lot with web services and messaging technologies like RabbitMQ and Spring Integration. The day when I started with streaming brokers I was a little bit confused since everything seemed the same but in reality, was slightly different. There were and still are some subtle differences between queues and streaming brokers. In this post, I will focus on them and try to give a better definition for queues and streams.
New ebook 🔥
Learn 84 ways to solve common data engineering problems with cloud services.
I will start this post by defining both messaging technologies in 2 different sections. They won't be in-depth definitions but they should suffice to get the main points. If you are interested to learn more details, I invite you to read "Understanding message brokers" e-book (link in "Read more" section). In the last part of the post, I will compare queues and streams.
Message queue is also known as a point-to-point communication and it's a good term to understand it easier. An example of a point-to-point exchange in the real world can be a parcel delivery. Let's imagine that you ordered 3 products on a market place, each of them is sold by a different seller. You're expecting then to receive 3 parcels. It's a point-to-point communication since a message (your parcels) is sent by producers (sellers) directly to one specific consumer (you).
Of course, the comparison was a little bit simplistic because a message queue can have one or more consumers. In such a case, the broker will try to distribute the messages evenly among them. The only guarantee is that every singular message will be delivered only once to one of available consumers.
Message queues are characterized by 2 other properties, durability and persistence. When there are no active consumers, the broker will keep the messages before dropping them from the queue. We say then that the messages are durable. But if the broker fails and restarts meantime, it may drop buffered messages from memory. To avoid the dropping, we need to persist the messages, very often by adding them into a persistency logs file.
Among the examples of message queues you can find RabbitMQ but also AWS SQS, GCP Pub/Sub and Azure Queue Storage.
Streaming broker is often compared to a distributed append-only logs file where every new message is added at the end of the persistent log. One message can be delivered to one or more consumers. In other words, the message will be consumed by subscribers of given log file.
Moreover, not only currently active consumers will receive given message. When a new consumer subscribes to the logs system, it can move inside the logs file to any position and start to read the messages from that place.
Among the implementations of streaming brokers you will find 2 major actors, Apache Kafka if you want to benefit from a rich ecosystem, and AWS Kinesis if you want to stay fully managed.
Queues vs streams
Now, when we have some basic information about message queues and streaming brokers, it's a good moment to compare them. From the consumer's point of view, a common point is a need for an acknowledgement. Queue and streaming consumers must inform the broker about already consumed messages. For the queues, the most popular mechanism is the acknowledgment message whereas for streaming, the consumers often commit their last processed position. But major similarities stop there.
Streaming and queue consumers are different in terms of delivery. Queue delivers given message to only one consumer whereas streaming broker sends it to all subscribers. Moreover, once delivered, queue message is lost. If you want to reprocess it, you will need to have a backup storage, like for instance a batch layer. On the other side, since streaming broker is a big distributed log file, the consumers are able to move back and forward, and reprocess already received messages. Of course, if you assign a TTL to the messages, you will be able to reprocess only not expired ones.
Thanks to the above paragraph we can also draw a landscape of use cases. Message queues, mainly because of the only once delivery, can be considered as a back pressure mechanism in micro-services architecture. One example I like to quote and that maybe you've already heard about, is image resizing. Instead of resizing the picture in a synchronous manner, directly after the upload, you can postpone it and put the command of resizing in your message queue. You could also use a streaming broker for that as well but it's more like using a sledgehammer to crack a nut. A streaming broker works better as a distributed data store that can be used as a source of truth in modern real-time and data-centric architectures. For instance, imagine that you need to expose given message in different formats (search, graph, fast access in key/value store, …). You can't do that with message queues because a message can be delivered only once. On the other side, you can do it pretty easily with a streaming broker. This technique is called by the way the polyglot persistence and I described it in Polyglot persistence - definition and examples post.
Messaging technologies are a central piece of a lot of nowadays architectures. As explained in the post, queues and streams are different. They have different delivery semantics and, even though sometimes they can be used interchangeably, they often have very different use cases.