Make your data disappear on the cloud

Even though the storage is cheap and virtually unlimited, it doesn't mean we have to store all the data all the time. And to deal with this lifecycle requirement, we can either write a pipeline that will remove obsolete records or we can rely on the cloud services offerings for data management. I propose a short overview of them in this blog post.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

The blog post is composed of 4 sections. Each of them presents one cloud data services feature that can help to automate data management.

Object lifecycle

The first - and if you already worked on the cloud, probably the most obvious - solution is to use object lifecycle policies that are available in object store services like S3, GCS and Blob Storage. In a nutshell, the idea is to configure:

the action - it can either be a transition between storage classes or the deletion of the object
the condition - it defines the objects the cleaning process can take action on

That's a very simplistic vision because if you check the exact features, you will see that you can do a lot. Really a lot! In all of the quoted cloud provider services (AWS S3, Azure Blob Storage, GCP GCS), you can create the lifecycle rules by filtering the objects by their prefixes. For example, you could say that every object located in the bucket "X" starting with "2021/01" should be removed 30 days after creation.

By the way, this temporal criteria is the second condition you can apply. All providers come with the "age" condition, meaning that you can remove or transition the object after living a specific time. But in addition to this, you can also perform more advanced and targeted actions. On AWS S3 you can for example target the lifecycle policies to the object having specific tags which are key-value pairs. Also, you can not only specify the number of days as the temporal condition but also use a specific date! So that you could say for example that all objects tagged with "year=2021" should be removed or archived on December, 31 2021!

Azure Blob Storage goes even further and comes with the possibility - as of this writing, it's still in Preview (January 26, 2021) - to define the actions on the last access time criteria. You can then automate removing or archiving any not accessed object, potentially, very likely, an object storing obsolete information.

Archive storage

When you decide to use object lifecycle policies, one of the possible choices is to transition the object to the archive storage class. An archive storage's general idea is to provide a storage place (class) to rarely access and long-term stored objects, like the backups or compliance datasets. Even though the archive storage doesn't make the data disappear, it reduces its visibility for day-to-day usage.

All of the 3 major cloud providers come with archive storage proposals. GCS has a storage class called Archive that should be used for datasets requiring at least 365-day storage duration and rare access since the data retrieval is more expensive than for the other classes. Fortunately, the data retrieval doesn't differ a lot from the standard storage classes. In other words, the first byte should be available within tens of milliseconds.

I mentioned GCS at the beginning, not by mistake. Unlike Glacier (AWS) and Archiver tier (Azure), it's the single one considering archived data as standard data from the data retrieval latency standpoint. If you use the Azure Archive tier, you will have to wait several hours before working on your data. On the other hand, the minimum required time to consider the dataset as "archived" is only 180 days.

With Glacier, you can use one of 3 accessible retrieval options corresponding to the use cases. For Active Archive, you can use expedited mode and get your data back in 1 to 5 minutes. Less time-sensitive retrievals called standard can return the data between 3 and 5 hours whereas the least ones - but at the same time, the cheapest called bulk - between 5 and 12 hours.

Tables expiration

In addition to managing the data on object stores, you can also automatically manage it in data warehouse services like BigQuery. And it can be configured in different ways. For the partitioned tables, you can set the partition expiration and remove the partition after reaching this time. The expiration policy also applies to the whole table, in case you needed to create a table to work on only during a specific period.

I didn't find a similar configuration for AWS Redshift or Azure Synapse Analytics, probably due to the architectural design differences. My feeling is - but I don't have any paper to confirm that - BigQuery expiration mechanism has the same backbone as the one responsible for GCS objects expiration. Technically it seems possible since the data is stored in a completely different layer (check my last post on BigQuery schema design) which may have similar characteristics to GCS. Anyway, if you have some information or more precise information on that, please leave a comment. I will be happy to learn!

Records expiration

To terminate, you can also control the data removal at the record-basis. Of course, it's possible only on some specific data stores. Do not expect to be able to remove lines from an object on S3 or GCS.

Most of the time, this feature will be enabled in key-value data stores like DynamoDB. When you create your table, you can define one column as the column storing the Time-To-Leave (TTL) information as a timestamp. It means the background process will delete all rows with the TTL value lower than the current time. The removal is not immediate. The process is asynchronous and doesn't guarantee the delete the object at the exact expiration time.

On Cosmos DB, the expiration feature can be configured globally, for a table (container in Cosmos DB's terminology) or individually, for every item, exactly like in DynamoDB. But unlike DynamoDB, Cosmos DB TTL configuration defines the number of seconds from the last item access time. In other words, if the item is not modified within this period, it becomes the candidate for removal. The deletion process is asynchronous and is made only when there are enough Request Units left; i.e. the RU that weren't used for reading and writing.

Finally, the record expiration can also be configured in BigTable. But here, the TTL policy applies on the cell level; i.e. the intersection between the row and the column. Every cell can have different versions, and you can associate a timestamp to them that will be used by the Garbage Collection process to figure out what values are too old to be kept. You can also apply different strategies and define the max number of versions to keep for every cell. As you can see, the TTL is then a bit different since it applies on the attributes level and not the row itself.

As you can notice, cloud providers come with managed solutions to deal with data removal a little bit like for data security. It can be automated through a lifecycle management for object stores whereas for other databases, either as an explicit expiration policy on the tables or as a column at record-basis. Some of them also can be plugged into messaging services and generate notification events.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects