Data ingestion to the cloud object store

The volume of the data to migrate from an on-premise to a cloud environment will probably be less significant than previous years since a lot of organizations are already on the cloud. However, it's interesting to see different methods to bring the data there and that's something I'll show you in this blog post.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz


Bad news. There is no a single way to transfer data to the cloud. Firstly, there are 2 types of data ingestion, the online and offline. The difference? The offline mode works asynchronously. The users first copy the data from an on-premise environment to a physical device. Later, they send this device to the cloud provider, who takes care of ingesting it to its data center. In the online mode, the process copies data synchronously.

Moreover, you will find several implementations in these 2 modes. To choose the migration tool, you'll often need to consider these 3 points:

Those are the generic points to consider in the data ingestion process. But what are the cloud services you can use?

Offline transfer

Let's start with the offline ingestion. The cloud provider sends you a physical device that can be small or really huge, where you'll be responsible for putting all the data to integrate into the cloud storage. In this category, you'll find the following services:

Service Storage capacity Misc
AWS Snowcone 14TB Optimized for transferring large files, starting from 5MB. Supports AWS KMS encryption keys.
AWS Snowball 80TB The device can have an extra compute capacity to perform local processing on Lambda functions or EC2-compatible instance. Data is encrypted with the keys managed from AWS KMS.
AWS Snowmobile 100PB AWS sends a 45-foot long shipping container that you can connect to your network with the help of a dedicated AWS engineer. All transfered data is encrypted with the keys from AWS Key Management Service (KMS). Additionally, the container has GPS tracking, alarm monitoring, and 24/7 video surveillance.
Azure Data Box Disk 35TB Uses a USB 3.0 connection to transfer the data and stores the data encrypted.
Azure Data Box 80TB Exposes 1-Gbps or 10-Gbps network interfaces and stores the data encrypted.
Azure Data Box Heavy 800TB Transfers data with 40-Gbps network interfaces and encrypts it at rest.
Azure Import/Export variable In this mode, you can use your SSD/HDD SATA II or SATA III disks. You must install Azure Import/Export binary, copy the data, and send the disk to Azure. This mode is supported for Azure Blobs and Azure Files.
GCP Transfer Appliance 300TB encrypted Recommended for 10TB or more, when the network transfer would take more than a week. Supports encryption with the keys managed in the Cloud Key Management Service (KMS).

Online transfer

The online ingestion works best for small datasets or a high speed bandwidth. If you need to transfer a big volume of data and have a poor connection, you can ask your cloud provider for a dedicated high throughput network. AWS provides it with DirectConnect, Azure with ExpressRoute, and GCP with Interconnect. The idea is simple. They provide up to 100 Gbps and a dedicated connection between your data center and the cloud provider. You can then leverage it for a faster data transfer, even for bigger datasets. How fast can it go? According to the GCP's calculator, moving 1TB of data over a 100 Gbps network takes only 2 minutes!

But even if you don't have a dedicated fast connection to the cloud, you can still work in the online mode and use one of the available dedicated cloud data ingestion services:

AWS DataSync The service synchronizes data between on-premise and cloud storage services, including S3, EFS, FSx, NFS, SMB, HDFS, and on-premise object stores.
AWS Storage Gateway The service has 3 gateways (file-based, volume-based, tape-based). It's a kind of proxy between on-premise appliances and the virtually unlimited cloud storage on S3. The applications can use one of the supported protocols (iSCSI, SMB, and NFS) to interact with S3 without needing to override the I/O exchange part.
AWS Transfer Family Supports SFTP, FTPS and FTP-based data transfers to S3 and EFS.
AWS Database Management Service You can use it for any database supporting the S3 as the target destination. The service will perform either full load or an incremental load from an on-premise database to the cloud.
GCP Storage Transfer Service The service to exchange the data between GCS and an on-premise environment. It comes with a native scheduling capability to support incremental copy.
Azure Data Factory The service supports moving the data between Storage Account and an on-premise storage with a COPY activity. The operation works also for the services of other cloud providers, such as BigQuery or RDS.
Azure Data Box Gateway Unlike other Data Box products, Gateway is a virtual machine provisioned in your virtualized environment or hypervisor. It's located in your premises and supports NFS and SMB protocols to transfer the data continuously to Storage Account (block blob, page blob, Files).

In addition to the dedicated services, AWS, Azure, and GCP has command line utils for an accelerated data transfer. On AWS you can use DistCp, on Azure AzCopy, whereas on GCP gsutils.

With the preparation for this blog post, I almost satisfied my curiosity regarding the data ingestion to the cloud. I've still one remaining topic to cover, which is API-based ingestion.