What cloud features for data processing patterns (ETL/ELT)?

During my study of BigQuery I found an ETL pattern called feedback loop. Since I never heard about it before, I decided to spend some time and search for other ETL patterns and the cloud features we could use in them.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

The article is organized into different sections. Each part presents one data processing pattern. I put the strongest accent on the cloud features we could use in each of them instead of on the pattern itself. It won't explain the orchestration part either. If you are curious about the patterns from an orchestrator's perspective, you can read my another article about ETL data patterns with Apache Airflow.

Output invalidation pattern

Let's start with the one I already covered in the blog, the output invalidation pattern. The goal is to invalidate the previously generated data so that the consumer always uses the most recent dataset.

To deal with the invalidation you can use different strategies. The first of them is based on the partitioning concept. If you use an object storage, you'll have to create partitions from your data processing layer with the methods like partitionBy in Apache Spark. If you use a natively partitioned data store like BigQuery, you can directly work on the partitioned tables.

A common point for these 2 approaches is that you'll have to delete the data every time. Apache Spark handles it transparently if you use the truncate save mode. For the partitioned tables you will have to use WRITE_TRUNCATE write disposition.

Change Data Capture

The second pattern called Change Data Capture is a way to synchronize one database with other data stores in near-real time. The idea is to read the transaction logs and replay them in the connected data stores. And if you use cloud services, you have different choice to implement it.

On AWS you can use the Database Migration Service to enable continuous data replication between 2 databases, not necessarily of the same format. You can also precede it with a full offload.

On Azure, you can use Azure Data Factory service to perform continuous replication. It works a bit differently than the DMS because it uses a SQL query to fetch all changes and write them to another storage like an Azure Blob. On SQL Server, you can retrieve the changes to replicate from a cdc.fn_cdc_get_all_changes_${table to replicate} table.

Regarding GCP, one of the recommended approaches suggests the use of Debezium or any CDC-related projects, to extract the logs and write them to a Pub/Sub. These records are later continuously processed by Dataflow and loaded to another storage.

Slowly Changing Dimensions

The idea behind Slowly Changing Dimensions is to propose a mechanism to deal with the unpredictable but apparently not really frequent changes on the dimension tables from dimensional schemas (star, snowflake). But the dimensions, especially, in the Big Data context, can be problematic due to the distributed nature of the joins. That's why one of the accepted approaches to deal with them is data denormalization.

Besides this data denormalization approach, the dimensions can still be managed as separate tables. If they're small enough you can leverage the data locality concept to optimize the joins execution. One of the examples of such locality is ALL distribution style in Redshift where one table - usually small to reduce the synchronization cost - is replicated across all nodes of the cluster. Thanks to that, the joins can be performed locally, without moving the data across the network.

Apart from this data locality aspect, you can also fully reload the dimension table or apply any of the popular SCD types like row overwriting with UPDATE or history logging (a new value is added, previous one is marked as expired).

Fan-in pattern

Even though the fan-in pattern is mostly known from messaging systems, we can also implement it for batch pipelines with the help of cloud services. The idea of this pattern is to take multiple inputs and generate a single output. Of course, it can still be done from a data processing framework but it's also supported with the cloud services offering the federated querying feature.

On GCP, BigQuery comes with the federated query support for the data stored on GCS, Cloud SQL and even BigTable!

On AWS, Redshift supports querying data directly from S3 with Redshift Spectrum but it also has a built-in support for federated queries executed on top of RDS PostgreSQL. But Redshift is not the single service supporting federated queries. Another one is Athena which can execute a SQL query on top of RDBMS, data warehouses and NoSQL (DynamoDB).

On Azure, the federated queries are available in Azure Data Lake Analytics service where thanks to the U-SQL you can process both files and Azure SQL tables.

One point to notice though. The queries on the external data sources will be less performant than the queries made on the native data store backend. The difference is mostly due to the data organization differences and the lack of optimization for the querying of big volumes of data at once.

Event-based processing

To terminate, the event-based processing. The goal is to process the data in response to an event like an object creation or removal, or a new message in a message queue. To implement it, you will very often use a serverless function available on the cloud provider, like Lambda for AWS, Azure Functions for Azure or Cloud Functions for GCP.

All of these functions seamlessly integrate with the events happening on the cloud with the concept of triggers. And you can use them not only to directly process the data but also to delegate the processing to a batch pipeline.

Apache Airflow, and rather its managed version on GCP called Cloud Composer, has a feature called external trigger where a DAG starts only when its trigger HTTP endpoint is invoked. You can then use a Cloud Function to call this endpoint. You can use the same strategy if you are managing Airflow on your own, for example on a Kubernetes managed service.

Regarding Azure, Azure Data Factory natively supports pipelines triggered by the events. Or, not really since as of this writing, only the object creation (create + edit) or delete on an Azure Container is supported. Anyway, it's a great opportunity to reduce the overhead of having a triggering function.

Naturally, all these patterns can be implemented with a data processing framework abstraction executed from an orchestrator. However, it was an interesting exercise to do and try to find similarities across different cloud providers. If you see something missing or think about a different implementation for one of the listed patterns, the comments are yours! I will be happy to learn more.