Data is a live being. It's getting queried, written, overwritten, backfilled and ... migrated. Since the last point is the least obvious from the list, I've recently spent some time trying to understand it better in the context of the cloud.
New ebook 🔥
Learn 84 ways to solve common data engineering problems with cloud services.
Data migration is not as simple as moving the data from database A to database B. The data located in a different place is only the finality, but the whole process has more steps involved:
- Initial assessment. It's a preparation step. You check the work to do, detect the potential problems during the process, identify the affected consumers and any data-related changes caused by the migration.
- Define the migration. It's time to define the data migration type. The lift & shift is the easiest one because it simply consists of moving the data to the new data store. A more complex scenario involves additional steps, such as adding new data transformations or combining with other datasets. You should have the migration plan, including the technical part, ready at the end of this step.
- Create the technical solution. It's time to implement the data migration job or set up an existing tool for your scenario. The solution should be tested against the worst case scenarios to migrate. So that you can avoid surprises while running it on production.
- Perform the migration. It's the moment when the tool will take the data from one database and move it to another. The process should provide you with the monitoring metrics useful in spotting any issues.
- Assert the migration. Do not communicate the success to your data consumers after performing the migration. Although it's executed successfully from the technical standpoint (migration job success), you don't know the data state yet. Maybe the migration job has some hidden issues, and it migrated several fields incorrectly, for example, by losing precision in floating numbers or incorrectly formatting the date fields? You should then assert the migration to ensure the completeness and correctness of the new dataset.
- Communicate the change. Finally, you can ask your data consumers to switch their input path to the new database. You can omit this step when there are no active consumers.
Data migration and cloud services
There are different categories of data migration. The first one divides the pipelines in terms of continuity. We talk then about continuous and offline data migrations. The difference between them is the migration time. The data flows from one database to another in real-time in the continuous migration. On the other hand, offline migration involves stopping the data source and moving its data to another data store. What are the cloud services to put in front of each mode?
- Continuous migration. You'll find here:
- Streaming systems (AWS Kinesis, Azure Event Hubs, GCP Pub/Sub) and their consumers (AWS Lambda, AWS EMR, Azure Databricks, GCP Dataflow). You can natively perform continuous migration with the streaming consumers. Often, the consumers will be easily extensible with custom data transformations.
- Messaging queues (AWS SQS, Azure Event Grid) and event notifications (S3, GCS, ...). Here the strategy is to expose a data migration endpoint that will be notified about new data to migrate.
- Change Data Capture (CDC) systems. You can use here NoSQL stores (AWS DynamoDB Streams, Azure Cosmos DB Change Feed) and data migration services (AWS Database Migration Service, GCP Datastream).
- Delta column. This strategy works for batch processing. It requires identifying a column that will return the data changes since the previous batch execution. You can use here a custom batch job executed from a data processing service (AWS EMR, Azure Databricks, GCP Dataflow), or use a built-in mechanism (Azure Data Factory with its Copy Activity).
- Offline migration. You can use here data migration service (AWS Database Migration Service, Azure Data Factory with its Copy Data tool, GCP Database Migration Service).
The second category for data migration is based on the databases differences. When the source and destination are the same type, we talk about homogeneous migration. In the opposite case we perform a heterogeneous migration. You'll find pretty the same services as for the continuity category:
- Streaming systems and their consumers support both migration types. For the homogeneous scenario, they'll be simple passthrough jobs and for the heterogeneous case, they should be enriched with an extra data mapping stage.
- Similarly, the messaging queues and event notifications behave differently. In the homogeneous scenario, the notified applications use the copy method of the cloud SDK to duplicate the data. On the other hand, the heterogeneous scenario requires accessing and transforming the data.
- Change Data Capture (CDC) for NoSQL systems work exactly as the streaming systems. The consumer can transform the data to a different format or just move it without transformation. The heterogeneous scenario expects some custom mapping code, though. On the other hand, CDC-based data migration tools (AWS Database Migration, GCP Datastream) don't have the custom schema transformation capability. For AWS Database Migration you have to perform the mapping before the migration with the help of the Schema Conversion Tool. In the case of GCP Datastream you have nothing to do but it only supports GCS as the destination.
- Delta column. This approach supports heterogeneous and homogeneous scenarios, with a possibility to apply a custom transformation at the data processing service level if needed.
- Offline migration services also support both scenarios but with an extra mapping step in the heterogeneous migration. For example, Azure Data Factory Copy Activity supports schema and data type mappings as a part of the pipeline but AWS Database Migration Service expects the schema definition before the data migration.
Data migration isn't just about moving the data. It's also about planning, ensuring consistency and tooling. When you perform the migration on the cloud, you can use a dedicated service like AWS Database Migration or GCP Datastream, or a custom mechanism with some extra transformation logic.