Not losing data on the cloud - strategies on waitingforcode.com - articles about Data engineering on the cloud

Data is a valuable asset and nobody wants to lose it. Unfortunately, it's possible - even with the cloud services. Hopefully, thanks to their features, we can reduce this risk!

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

When I first heard about data loss prevention, it was in the context of GCP's Data Loss Prevention service. The data loss in its context means sensitive data. However, I will use an extended definition in this article, including corrupted, deleted, or unreadable data. The user doesn't necessarily mean "a hacker". It can also be someone from your organization! Some strategies exist to avoid bad surprises, and I'll cover 5 of them just below.

Strategy 1 - At-least privilege

It's the most common approach in data security - give only the needed access. For example, if an Apache Spark job only needs to read data from an S3 bucket, you don't need to give the write permissions to it. Thanks to that, you add the first security layer protecting the data against malicious users and also human mistakes.

To implement this strategy, you can use custom permission policies (AWS IAM), custom roles (GCP IAM, Azure IAM). In addition to this pure control access solutions, you can also hide your data in the cloud by blocking public access. It's quite easy to do for object stores but it's also possible to implement with the networking strategies explained in the Cloud networking aspects for data engineers.

Strategy 2 - backups

But what if you missed removing the unnecessary write permissions and someone deleted your data? Nothing is lost if you configured the backups. Data warehouses like AWS Redshift or Azure Synapse Analytics offer a classical backup strategy by creating point-in-time restore snapshots. They can either be automatic or manual, but keep in mind that the former are fully managed by the service and often have a limited retention period. To overcome that issue, you will have to copy them or take manual snapshots, which can also have limitations like the max number of such snapshots.

It's also worth mentioning that the backups exist on other data stores like Azure Storage, AWS DynamoDB, Azure CosmosDB, GCP BigTable.

A drawback of the backup approach is that you can restore the data up to the snapshot period. In the documentation, the term you should look for is Recovery Point Objective. It represents the maximum amount of data that can be lost after a recovery from a failure. Azure Synapse supports 8 hours RPO. AWS Redshift takes a snapshot every 8 hours or 5GB of changes. So the restored database will not always have the full data prior to the data loss event.

Strategy 3 - Data versioning

If you followed the data landscape, you certainly heard about data versioning in Delta Lake. It's correct, but the data versioning concept is also present on cloud services! You can use it in BigQuery to restore a previous version of your table with the SYSTEM_TIME AS OF decorator. You can use it even for the deleted tables within 7 days after the removal!

Apart from BigQuery, data versioning is also supported in object stores. If someone overwrites or removes a versioned object, you can easily restore the previous version. One point to notice, though. The enabled versioning stores the physical data of the previous versions, so it automatically increases the storage cost. Hopefully, it can be easily mitigated with an appropriate lifecycle policy that, for example, could only keep the X most recent versions for each object.

Unfortunately, very often the versioning approach applies to the data itself (objects, rows, documents, ...). It doesn't prevent you from removing the container for the data. For example, if you delete a BigQuery dataset, you will not be able to recover the versioned tables. Same for an S3 bucket. As of this writing (08/06/2021), only Azure Storage seems to have a protection against that with container soft-deletes, so deletes that you can rollback.

Strategy 4 - Immutability

Soft-deletes also exist for objects, but there is still the risk that you miss the erroneous delete notification and not restore your data in time. Hopefully, there is another strategy based on immutable data. Immutable here means read-only data. The read-only character will often be limited in time, for example, to comply with some regulatory purposes.

Object stores have a native support for this strategy with the locks. On S3, you will find an Object Lock feature. GCS and Azure Storage support it as a Retention Policy. Locks and retention policies seem to be reserved to the object stores, though.

For the other services, you can implement the principle of the at least privilege and keep only a limited number of writers. It doesn't guarantee immutability, but together with a proper CI/CD flow and secured environment should bring an extra layer of security.

Strategy 5 - backfill

The last strategy doesn't use any cloud features, but it's important to mention. Very often, a dataset is built from one or other upstream datasets. You can use this dependency to regenerate it in case of removal. Of course, this strategy may not be the most cost-optimal, but if you value your data and cheaper strategies didn't work correctly, you can consider it a last resort. It can also complete the drawbacks of other strategies, like the missing RPO of backup.

The article presented 5 different strategies that you can implement to reduce the risk of losing the data. None of them guarantees complete security, though. Backup may not restore the most recently written data, it may not apply at the service level. Data versioning may not work in the context of the data container and immutability won't be easy to implement on all data services. Hence, there is no silver bullet but rather some guidelines that should together strengthen the safety of your datasets!

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects