Time travel on the cloud

I've first heard about the time travel feature with Delta Lake. But after digging a bit, I've found that it's not a pure Delta Lake concept! In this blog post I will show you what cloud services implement it too.

BigQuery

The second data store where I've discovered the time travel feature is GCP BigQuery. To restore a historical version in this data warehouse service you have to use the FOR SYSTEM_TIME AS OF expression in the WHERE clause.

The operation has one gotchas, though. It doesn't work for the deleted tables. In that scenario, for example, to restore a mistakely deleted table, you must use the snapshots feature and the bq cp command. This approach works only up to 7 days after the table deletion.

Temporal tables in SQL Database

Another cloud service implementing data versioning, and hence enabling the time travel capability, is Azure SQL Database. The service implements ANSI SQL 2011 temporal tables and works for SQL Server database.

To create a temporal table, you have to enable the versioning with SYSTEM_VERSIONING = ON clause. When used, it automatically creates and associates a historical table to the table you want to expose to your end-users. You can then manipulate your data like it would be a normal table. Except for the moment when you need to find an older version for the rows. In that scenario, use the already presentedFOR SYSTEM_TIME clause.

Object stores versioning

Another type of service supporting time travel are object stores. AWS, Azure, and GCP object stores support object versioning. With this feature enabled, whenever you replace or delete an object, the service will create a new version for it.

To restore a past version on AWS S3 and Azure Blob Storage, you have to reference a version id. On GCS you'll need to provide a generation number in the object get operation.

BigTable cell

Time travel feature is also present in GCP BigTable. When creating a column family, you can set a parameter representing the number of versions to keep for each cell (= intersection between the row and column). As an alternative strategy, you can also define how old the cell data can be. In BigTable terms, this data versioning feature is called garbage collection.

To time travel within each cell, you can iterate over all the entries and choose the version you want to work with.

Streaming brokers

Probably, I should have started with the streaming brokers! As you know, they store data in an immutable append-only log. Thanks to this type of storage, you can read the data at any point. You can define a specific starting offset or the ingestion timestamp.

On the cloud, AWS Kinesis Data Streams lets you read records from a specific timestamp by using AT_TIMESTAMP starting position. On Azure Event Hubs you can specify the time-based starting position with a parameter called EventPosition. The timestamp will then apply to the enqueued time attribute of the stream.

Backups

A limited example of time travel are backups. A backup captures your data at a specific point in time and creates a recoverable file with it. So, you can time travel and restore a past version of your data. However, unlike the examples quoted above, you will always need to materialize the backup; backup itself is not queryable like BigQuery or temporal tables in SQL Database.

Custom versioning patterns

Finally, you can also implement a custom versioning strategy. For example, in a data store supporting row and sort key combinations like AWS DynamoDB, you could identify the versions thanks to the sort key column. An entity will then have multiple rows like v1, v2, v3, ... representing the values changing over time. It can also have a sort key called "latest" storing the same values as the last written version to optimize the access.

I hope that you can see now that the time travel feature is not tool-specific. It exists on cloud services of various types; serverless data warehouses, streaming brokers, or object stores.