Object stores on the cloud

The next step of my multi-cloud exploration will be object stores. In the article I will try to find similarities between S3, Storage Account and GCS.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

The article is divided into 5 sections. Each of them covers one feature that is present on all of these 3 object store services.

Versioning

The first discussed feature is versioning. Versioning is an interesting strategy to prevent accidental data loss. When enabled, all writing operations will result in a new version of the object, meaning that you can restore the previous ones at any time. It also applies to the deletes because the delete will create a delete marker and keep the previous versions recoverable. Of course, the trade-off here is an increased storage space because you will keep as many versions as the writes. It can also sometimes lead to throttled requests due to a small number of objects with many versions. Also, please notice that the versioning doesn't protect against buckets or container removal. Azure has a preview feature with container soft deletes and resource locks, which could eventually mitigate this problem. For all providers, you should be able to control the allowed actions against the bucket with the help of permissions.

Locking

Sometimes, we can also need to lock an object, i.e. write it once and keep it not changed, for example, due to regulatory purposes. S3 enables this with S3 object lock feature that you can configure as a governance or compliance mode. The former enables the changes for authorized users whereas the latter disables it completely, even for the root users!

Regarding Azure Storage Account, it comes with a feature called immutable storage that you can use to pass the blobs to a read-only state. Here too, you can configure it in 2 different ways. Either with a time-based retention policy or with an explicit legal hold.

On GCS you have a similar feature called object holds. Any object with a hold cannot be deleted and modified. It has 2 configuration modes, and because of them, it can behave differently if the bucket has a retention policy configured. The first one, called event, resets the object time, whereas the second, called temporary, doesn't impact it. For example, let's suppose that you put an event-based object hold for 1 month on an object, and it exists in a bucket having a 6 months retention period. Now, when you release the hold after 1 month, you will be able to delete the object only in 6 months, and not 5 - in other words, the hold time doesn't count in the retention period configuration.

Storage classes

Object stores also have various storage classes. The simplest one seems to be Azure Storage Account with its hot, cool and archive data distinction. But the things go more complicated if you mix them with locally-redundant, zone-redundant, geo-redundant and geo-zone-redundant storage impacting the fault-tolerance semantic.

GCS has an extra storage class compared to the Storage Account. You will find there: Standard, Nearline, Coldline and Archive classes. Here too, things go a bit complicated if you involve the fault-tolerance because you can store the object in a single region, dual-region and multi-region configuration.

Finally, AWS S3 which has the most storage classes because you can use: Standard, Intelligent-Tiering, Infrequent Access (IA) Glacier, Glacier Deep Archive, Outposts, and if you include the geographical consideration into account, also the One-Zone-IA or replicated buckets.

An important thing to notice is the data retrieval for the archival storage classes is much slower in S3 and Storage Account.

Object lifecycle

A concept closely related to the storage classes is objects lifecycle. Thanks to this feature, you can automate the storage class transition. For example, you can define a rule that will move every object older than 6 months to the archive class. It can also be a good way to reduce the costs caused by the use of versions or unused objects since the lifecycle configuration also comes with object removal.

All of the 3 cloud providers described in this article provide the object lifecycle management with the possibility to configure the storage class transition rules and the expiration actions.

Processing

To terminate this article, let's see what can we do with object stores? The first use case is their integration with classical data processing frameworks like Apache Spark or Apache Flink where they're considered as the first class citizens, a bit like HDFS in the past.

The second use case is event-driven processing, where you can launch your data processing to the response of object creation or deletion. All of the discussed cloud providers delegate their serverless functions for that purpose. However, some of them go a bit further, and, for example, Azure Data Factory allows to trigger a batch pipeline when at the blob creation or deletion.

The last use case is ad-hoc querying. Object stores are the backbone of external tables, like for example, the ones you can use in AWS Redshift, Azure Synapse SQL or GCP BigQuery.

Apart from the integration aspect, it's interesting to see some read and write characteristics. For the reading part, if you know precisely the bytes range you want to read, you can use the range download feature that will add a Range header to the request and fetch only the data you want to. For the writing part, you will have the possibility to accelerate the operation by using concurrent upload, i.e. the upload that divides the object into multiple chunks assembled on the service once successfully transferred. GCS proposes an extra feature related to that which is the object composition. Thanks to it, you can create a new object from multiple already stored ones.

The article listed the features parity of the object stores in AWS, Azure and GCP but please, don't get me wrong! Sure, they're very similar and probably switching from one to another will be relatively easy, but all of them have some hidden power like the hierarchical namespaces in Azure, low latency access to the archive in GCP, or ad-hoc querying directly on S3 objects with S3 Select.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects