Immutability in Big Data
This time we'll focus on the immutability in Big Data. The first part describes the whole idea of this property. The next one shows some pros and cons of the immutability. It's only a theoretical and introductory post. Next ones will try to describe real-world cases of the data-driven applications based on the immutable data.
Immutability and data
The Big Data immutability is based on similar principles to the immutability in programming data structures. The goal is the same - do not change the data in-place and instead create a new one. The data can't be altered and deleted. This rule can be defined for the eternity or only for a specific amount of time.
Thus the immutability prohibits the in-place changes that are represented by random access. Instead of overriding already existent data we can only add a new one. This new data through one of its attributes can help to distinguish between old and the most recent version (e.g. data creation timestamp attribute). A great example of immutable data store is Apache Kafka - an append-only distributed log system.
Pros and cons
So now, why it's good to prefer immutable data ? First of all we're protected against data loss. The data loss is often due to human errors such as: software bugs, incorrect SQL query and so on. With the mutable data we haven't a possibility to easily recover from this situation. Of course, we could use the systems of backups but unfortunately the good backups are difficult to put in place and costly. Moreover they don't guarantee the data consistency (e.g. when the backup ends with a half of valid and other half of invalid data). When the data is immutable, we can back in time to the moment when the bug didn't exist and recompute the data after fixing the problem in our code. For the case of already mentioned Kafka we could simply skip the offsets of invalid data and start to reprocess it from the valid one.
Thus the immutability facilitates data reprocessing. If your processing pipeline contains a bug and it processes hundreds of millions event per day, with mutable approach you'll probably first remove all the invalid data and later regenerate it. This is very difficult to do quickly in the scenarios constrained by the time.The dataset to remove is so big than doing that could double or more the time needed to do the recovery. Instead with immutable data we can do what has been told, i.e. come back to a valid state, recompute everything and mark regenerated elements with new validity mark (e.g. version suffix, version column etc.).
However, the immutability is not a silver bullet. First, it increases the amount of stored data. But nowadays it's less problematic than 10 years ago because the storage is cheaper. The second disadvantage is more important. Depending on the place in the pipeline, the immutable data can introduce a lot of complexity. The applications must be able to distinguish invalid/old data from the valid/most recent one. And this last point will be developed in later posts.
As we can learn in the first section of this post, the immutability is the inverse of mutability often described as a destructive update. The destruction is related to the data that, once changed, can't be recovered later. In the immutability the data is considered as not alterable asset. And as proven in the second section, it brings a lot of advantages: easier recovery, data tolerant against human and machine errors. However, it comes also with a cost that the most often is translated in the increased complexity of the system.