Data migration on the cloud

Data is a live being. It's getting queried, written, overwritten, backfilled and ... migrated. Since the last point is the least obvious from the list, I've recently spent some time trying to understand it better in the context of the cloud.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

Data migration

Data migration is not as simple as moving the data from database A to database B. The data located in a different place is only the finality, but the whole process has more steps involved:

  1. Initial assessment. It's a preparation step. You check the work to do, detect the potential problems during the process, identify the affected consumers and any data-related changes caused by the migration.
  2. Define the migration. It's time to define the data migration type. The lift & shift is the easiest one because it simply consists of moving the data to the new data store. A more complex scenario involves additional steps, such as adding new data transformations or combining with other datasets. You should have the migration plan, including the technical part, ready at the end of this step.
  3. Create the technical solution. It's time to implement the data migration job or set up an existing tool for your scenario. The solution should be tested against the worst case scenarios to migrate. So that you can avoid surprises while running it on production.
  4. Perform the migration. It's the moment when the tool will take the data from one database and move it to another. The process should provide you with the monitoring metrics useful in spotting any issues.
  5. Assert the migration. Do not communicate the success to your data consumers after performing the migration. Although it's executed successfully from the technical standpoint (migration job success), you don't know the data state yet. Maybe the migration job has some hidden issues, and it migrated several fields incorrectly, for example, by losing precision in floating numbers or incorrectly formatting the date fields? You should then assert the migration to ensure the completeness and correctness of the new dataset.
  6. Communicate the change. Finally, you can ask your data consumers to switch their input path to the new database. You can omit this step when there are no active consumers.

Data migration and cloud services

There are different categories of data migration. The first one divides the pipelines in terms of continuity. We talk then about continuous and offline data migrations. The difference between them is the migration time. The data flows from one database to another in real-time in the continuous migration. On the other hand, offline migration involves stopping the data source and moving its data to another data store. What are the cloud services to put in front of each mode?

The second category for data migration is based on the databases differences. When the source and destination are the same type, we talk about homogeneous migration. In the opposite case we perform a heterogeneous migration. You'll find pretty the same services as for the continuity category:

Data migration isn't just about moving the data. It's also about planning, ensuring consistency and tooling. When you perform the migration on the cloud, you can use a dedicated service like AWS Database Migration or GCP Datastream, or a custom mechanism with some extra transformation logic.